Table of Contents¶

Table of Contents

1. Introduction

2. Set up, import data, and prepare data

3. EDA- Time Series Data

3.1 Initial Exploration of Time Series Data Set

3.2 Descriptive Statistics of Time Series Data Frame

3.3 Finding the columns having the NULL values

3.4 Total Inside Sales , everything that isn't made at the store

3.5 Check for Missing Values

3.6 Combined Average Sales (Inside + Food Service) by Site ID

3.7 Average Diesel Sales by Site_ID

3.8 Average Unleaded Sales by Site_ID

3.9 Distribution of Sales (Inside + FoodService)

3.10 Distribution of Fuel Sales (Diesel + Unleaded)

3.11 Correlation matrix of all the 4 target variables

3.12 Average Inside Sales by Day of the Week

3.13 Average Food Service Sales by Day of the Week

3.14 Combined Average Sales by Day of the Week

3.15 Average Diesel & Unleaded sales by Day of the Week

3.16 Time Series Sliders

3.17 Weekly Sales (Inside + Food Service)

3.18 Monthly Aggregated Plot - Total inside sales and Food Service sales

4. EDA- Quantitative Data

4.1 Set up, import data, and prepare Quantitative data

4.2 Finding Columns having Null values

4.3 Imputing Missing Values

4.4 Finding variability of columns

4.5 Detecting Outliers

4.6 Separate variables into categorical and continuous

4.7 Visualizing Qualitative Features

4.8 Label encoding for converting Categorical variables to Numerical variables

5. Merging the time series dataset with the qualitative dataset

5.1 Correlation Matrices

5.2 Holiday vs Non-Holiday total daily fuel sales

6. First Year Sales

6.1 First Year Sales Analysis (Inside + Food Service)

6.2 First Year Fuel Sales Analysis (Diesel + Unleaded)

6.3 First Year Sales Visualizations

6.4 Correlation Matrices for First Year Sales

6.5 Distribution of First Year Sales

7. Top Performing store (First-Year non-fuel sales)

7.1 Time-Series Decomposition for the Top-performing store

7.2 Autocorrelation for Top Store

7.3 Stationarity test

7.4 Time Series Decomposition of Top 5 Stores with the highest First-Year Non-fuel Sales

8. Top Performing Store (First-Year Fuel Sales)

8.1 Time-Series Decomposition for the Top-performing store

8.2 Time Series Decomposition of Top 5 Stores with the highest First-Year Fuel Sales

9. Questions

10. Results

11. Contributions

11.1 Daryle Bilog

11.2 Joe Sarnello

11.3 Sanskriti Bhargava

11.4 Vinay Kumar Vascuri

1.Introduction¶

Maverik, a high-growth retail firm planning to open 30 new stores annually. Accurate sales forecasts for the first year of these new stores are crucial for effective financial planning and resource allocation. Leveraging both qualitative and time series data, we will employ various machine learning algorithms to predict daily sales metrics. Our objective is to achieve forecasts that closely match the actual Return on Investment (ROI) and provide accurate sales metrics. The project focuses on generating daily-level forecasts, incorporating qualitative insights and network-wide seasonality patterns, to support Maverik's strategic decision-making.

Questions¶

  • Which weekday provides the most total sales? Is there more total sales during the weekend?
  • What time of the year provides the most total sales? Is there more during the summer?
  • Do most stores sell diesel gas?
  • Does a stores local population or population's income effect total sales?
  • Of the sites in the dataset which ones provide the most total sales? What charecteristics do they have?
  • Are there any columns that has low to no variance or most commonly known as zero variance columns?
  • Will inner merging using the 'site_id_msba' from the two datasets be efficient on creating a better model?
  • What days has the highest total sales from the datasets for each revenue generating columns (Inside Sales, Food Services, Diesel Gallons, Unleaded Gallons)?

2. Set up, import data, and prepare data¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
import os
os.chdir('/content/drive/MyDrive/maverik-capstone') # changing the default directory

3. EDA -Time Series Data¶

3.1 Initial Exploration of Time Series Data Set¶

In [ ]:
df_tsdm = pd.read_csv("time_series_data_msba.csv")
df_tsdm.head(20)
Out[ ]:
Unnamed: 0 capital_projects.soft_opening_date calendar.calendar_day_date calendar.fiscal_week_id_for_year calendar.day_of_week calendar_information.holiday calendar_information.type_of_day daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba
0 1 6/14/2022 6/17/2022 25 Friday NONE WEEKDAY 2168.2920 861.6930 722.7745 1425.1020 24535
1 2 6/14/2022 6/22/2022 25 Wednesday NONE WEEKDAY 2051.5635 808.0275 730.4850 1436.2740 24535
2 3 6/14/2022 6/23/2022 25 Thursday NONE WEEKDAY 2257.5000 966.4410 895.7970 1594.3725 24535
3 4 6/14/2022 6/26/2022 26 Sunday NONE WEEKEND 1520.5925 542.3250 584.2900 1124.9280 24535
4 5 6/14/2022 6/27/2022 26 Monday NONE WEEKDAY 1897.6930 771.4525 852.2605 1640.2540 24535
5 6 6/14/2022 6/30/2022 26 Thursday NONE WEEKDAY 2377.8965 878.6750 966.6860 1890.6510 24535
6 7 6/14/2022 7/1/2022 27 Friday NONE WEEKDAY 2696.2250 943.3095 894.9325 2202.2665 24535
7 8 6/14/2022 7/4/2022 27 Monday Independence Day WEEKDAY 2100.8750 686.1155 539.8400 1337.3815 24535
8 9 6/14/2022 7/6/2022 27 Wednesday NONE WEEKDAY 2402.5610 872.0180 994.6825 1993.9115 24535
9 10 6/14/2022 7/8/2022 28 Friday NONE WEEKDAY 2777.7190 1004.1465 1231.6675 2461.2700 24535
10 11 6/14/2022 7/9/2022 28 Saturday NONE WEEKEND 2227.2495 797.3770 869.2880 1719.4450 24535
11 12 6/14/2022 7/10/2022 28 Sunday NONE WEEKEND 2045.0955 697.8090 658.9170 1600.5080 24535
12 13 6/14/2022 7/12/2022 28 Tuesday NONE WEEKDAY 2544.9095 1046.7135 960.0990 2080.7045 24535
13 14 6/14/2022 7/15/2022 29 Friday NONE WEEKDAY 3193.7290 1308.3945 1554.9415 2724.9740 24535
14 15 6/14/2022 7/20/2022 29 Wednesday NONE WEEKDAY 2983.2075 1164.9120 1411.6445 2652.4400 24535
15 16 6/14/2022 8/6/2022 32 Saturday NONE WEEKEND 2892.3580 977.9770 1060.6960 3037.0200 24535
16 17 6/14/2022 8/9/2022 32 Tuesday NONE WEEKDAY 3147.4065 1278.9525 1341.1475 3681.8880 24535
17 18 6/14/2022 8/13/2022 33 Saturday NONE WEEKEND 2893.6285 953.0290 914.5185 3233.2790 24535
18 19 6/14/2022 8/14/2022 33 Sunday NONE WEEKEND 2726.1780 940.7825 915.2605 2802.6320 24535
19 20 6/14/2022 8/16/2022 33 Tuesday NONE WEEKDAY 3282.8145 1138.9035 1924.7410 3928.3790 24535

3.2 Descriptive Statistics of Time Series Data Frame¶

In [ ]:
df_tsdm.describe()
Out[ ]:
Unnamed: 0 calendar.fiscal_week_id_for_year daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba
count 13908.000000 13908.000000 13908.000000 13908.000000 13908.000000 13908.000000 13908.000000
mean 6954.500000 26.501079 2846.537988 759.922326 1702.585227 2382.091588 23041.052632
std 4015.038107 14.998715 981.299870 341.578220 2161.208192 1025.518658 710.634218
min 1.000000 1.000000 0.000000 0.000000 0.000000 240.180500 21560.000000
25% 3477.750000 14.000000 2181.156250 521.087875 383.062750 1654.149000 22540.000000
50% 6954.500000 26.000000 2693.976250 697.434500 1018.920000 2256.677500 22907.500000
75% 10431.250000 39.000000 3325.306250 924.282625 2283.297625 2928.254000 23555.000000
max 13908.000000 52.000000 7172.466000 2531.662000 20853.952000 8077.233500 24535.000000
In [ ]:
df_tsdm.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13908 entries, 0 to 13907
Data columns (total 12 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Unnamed: 0                          13908 non-null  int64  
 1   capital_projects.soft_opening_date  13908 non-null  object 
 2   calendar.calendar_day_date          13908 non-null  object 
 3   calendar.fiscal_week_id_for_year    13908 non-null  int64  
 4   calendar.day_of_week                13908 non-null  object 
 5   calendar_information.holiday        13908 non-null  object 
 6   calendar_information.type_of_day    13908 non-null  object 
 7   daily_yoy_ndt.total_inside_sales    13908 non-null  float64
 8   daily_yoy_ndt.total_food_service    13908 non-null  float64
 9   diesel                              13908 non-null  float64
 10  unleaded                            13908 non-null  float64
 11  site_id_msba                        13908 non-null  int64  
dtypes: float64(4), int64(3), object(5)
memory usage: 1.3+ MB

3.3 Finding the columns having the NULL values¶

In [ ]:
# Check for null values in the Time Series DataFrame
null_values_time_series = df_tsdm.isnull().sum()

# Display the columns with null values and their counts
print("Null Values in Time Series DataFrame:")
print(null_values_time_series[null_values_time_series > 0])
Null Values in Time Series DataFrame:
Series([], dtype: int64)

3.4 Total Inside Sales , everything that isn't made at the store¶

In [ ]:
# Group the data by 'site_id' and calculate the maximum and minimum inside sales
inside_sales_summary = df_tsdm.groupby('site_id_msba')['daily_yoy_ndt.total_inside_sales'].agg(['max', 'min'])

# Reset the index to make 'site_id' a column instead of an index
inside_sales_summary = inside_sales_summary.reset_index()

# Rename the columns for clarity
inside_sales_summary.columns = ['site_id', 'max_inside_sales', 'min_inside_sales']

# Display the summary
print(inside_sales_summary)
    site_id  max_inside_sales  min_inside_sales
0     21560         5975.6060         1339.6110
1     21980         6191.9060         1094.6740
2     22015         4871.8845         1170.4525
3     22085         7172.4660         1454.4460
4     22120         3849.2125          693.1925
5     22260         4692.3625         1207.9620
6     22330         3690.6800          165.7285
7     22400         4055.4920            0.0000
8     22505         3328.3390          449.0150
9     22540         4663.4350         1002.6240
10    22575         3794.0000          867.7970
11    22645         3346.2660          968.9645
12    22680         5710.0855         1131.3715
13    22715         4399.4545          825.9055
14    22750         6579.7865         1939.8330
15    22785         4282.6840         1740.1405
16    22820         3879.5400         1069.2465
17    22855         4811.7370         1165.8115
18    22890         4269.7655          655.8300
19    22925         3913.6825         1442.9975
20    23065         4264.6415          973.5880
21    23135         3394.3945         1006.6350
22    23240         4309.8160          886.3225
23    23345         5741.7500         1250.8090
24    23380         5895.5750         1553.1915
25    23415         6061.3595          524.1810
26    23450         5153.3440         1157.4745
27    23485         3021.8615          752.6085
28    23555         3753.8830          917.3535
29    23660         5943.5950           16.7545
30    23730         5514.4670         1686.8880
31    23765         4375.6615         1581.8320
32    23835         3860.3180         1340.1150
33    23905         3566.1675          821.5935
34    24150         5578.0900          844.8195
35    24220         5449.6400         1689.1385
36    24255         4558.6765          970.3645
37    24535         4405.1665         1257.7530
In [ ]:
daily_sales_max = df_tsdm.groupby('site_id_msba')['daily_yoy_ndt.total_inside_sales'].max()
In [ ]:
# Calculate the maximum daily inside sales for each site_id
daily_sales_max = df_tsdm.groupby('site_id_msba')['daily_yoy_ndt.total_inside_sales'].max()

# Find the site_id with the highest and lowest maximum daily inside sales
site_id_highest_sales = daily_sales_max.idxmax()
site_id_lowest_sales = daily_sales_max.idxmin()

# Display the results
print(f"Site ID with the highest maximum daily inside sales: {site_id_highest_sales}")
print(f"Site ID with the lowest maximum daily inside sales: {site_id_lowest_sales}")
Site ID with the highest maximum daily inside sales: 22085
Site ID with the lowest maximum daily inside sales: 23485
In [ ]:
df_tsdm.describe()
Out[ ]:
Unnamed: 0 calendar.fiscal_week_id_for_year daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba
count 13908.000000 13908.000000 13908.000000 13908.000000 13908.000000 13908.000000 13908.000000
mean 6954.500000 26.501079 2846.537988 759.922326 1702.585227 2382.091588 23041.052632
std 4015.038107 14.998715 981.299870 341.578220 2161.208192 1025.518658 710.634218
min 1.000000 1.000000 0.000000 0.000000 0.000000 240.180500 21560.000000
25% 3477.750000 14.000000 2181.156250 521.087875 383.062750 1654.149000 22540.000000
50% 6954.500000 26.000000 2693.976250 697.434500 1018.920000 2256.677500 22907.500000
75% 10431.250000 39.000000 3325.306250 924.282625 2283.297625 2928.254000 23555.000000
max 13908.000000 52.000000 7172.466000 2531.662000 20853.952000 8077.233500 24535.000000

3.5 Check for Missing values¶

In [ ]:
# Check for missing values in the entire DataFrame
null_values = df_tsdm.isnull().sum()

#Calculate the percentage of missing values
total_values = df_tsdm.shape[0]  # Total number of rows
percentage_missing = (null_values / total_values) * 100

# Create a DataFrame to display the results
missing_data_info = pd.DataFrame({
    'Column': null_values.index,
    'Missing Values': null_values.values,
    'Percentage Missing': percentage_missing.values
})

# Filter for columns with missing values
columns_with_missing = missing_data_info[missing_data_info['Missing Values'] > 0]

# Display columns with missing values
print("Columns with Missing Values:")
print(columns_with_missing)
Columns with Missing Values:
Empty DataFrame
Columns: [Column, Missing Values, Percentage Missing]
Index: []

3.6 Combined Average Sales (Inside + Food Service) by Site ID¶

In [ ]:
# Group data by 'site_id_msba' and calculate the average inside sales and food service sales for each site_id
average_inside_sales = df_tsdm.groupby('site_id_msba')['daily_yoy_ndt.total_inside_sales'].mean()
average_food_service_sales = df_tsdm.groupby('site_id_msba')['daily_yoy_ndt.total_food_service'].mean()

# Combine the average values by summing them
combined_average_sales = average_inside_sales + average_food_service_sales

# Sort the data in descending order of combined average sales
combined_average_sales = combined_average_sales.sort_values(ascending=False)

# Define a color palette for the bars
colors = sns.color_palette('colorblind', len(combined_average_sales))

# Create a bar plot for combined average sales by site_id with different colors
plt.figure(figsize=(15, 6))
ax = combined_average_sales.plot(kind='bar', color=colors)
plt.xlabel('Site ID')
plt.ylabel('Combined Average Sales')
plt.title('Combined Average Sales (Inside + Food Service) by Site ID (Descending Order)')
plt.xticks(rotation=90)  # Keep x-axis labels horizontal

# Add y-axis values with some space at the top of each bar
y_offset = 50  # Adjust this value to control the spacing
for i, v in enumerate(combined_average_sales):
    ax.text(i, v + y_offset, str(round(v, 2)), ha='center', va='bottom', fontsize=9, rotation='vertical')

# Adjust y-axis limits to provide space at the top
plt.ylim(bottom=0, top=max(combined_average_sales) + y_offset * 2)

plt.tight_layout()

# Show the combined bar plot
plt.show()

Site ID 22085 has the Highest Combined Sales of all the sites where as Site ID 23485 has the lowest combined sales. Top 5 site_ids having highest sales are 22085, 22750, 21980, 23415, 23660

3.7 Average Diesel Sales by Site_ID¶

In [ ]:
# Group data by 'site_id_msba' and calculate the average diesel sales for each site_id
average_diesel_sales = df_tsdm.groupby('site_id_msba')['diesel'].mean()

# Sort the data in descending order of average diesel sales
average_diesel_sales = average_diesel_sales.sort_values(ascending=False)

# Define a color palette for the bars
colors = sns.color_palette('colorblind', len(average_diesel_sales))

# Create a bar plot for average diesel sales by site_id with different colors
plt.figure(figsize=(15, 6))
ax = average_diesel_sales.plot(kind='bar', color=colors)
plt.xlabel('Site ID')
plt.ylabel('Average Diesel Sales')
plt.title('Average Diesel Sales by Site ID (Descending Order)')
plt.xticks(rotation=90)  # Keep x-axis labels horizontal

# Add y-axis values with some space at the top of each bar
y_offset = 50  # Adjust this value to control the spacing
for i, v in enumerate(average_diesel_sales):
    ax.text(i, v + y_offset, str(round(v, 2)), ha='center', va='bottom', fontsize=9, rotation='vertical')

# Adjust y-axis limits to provide space at the top
plt.ylim(bottom=0, top=max(average_diesel_sales) + y_offset * 2)

plt.tight_layout()

# Show the bar plot for diesel sales
plt.show()

21980 has the highest Diesel Sales

3.8 Average Unleaded Sales by Site_ID¶

In [ ]:
# Group data by 'site_id_msba' and calculate the average unleaded sales for each site_id
average_unleaded_sales = df_tsdm.groupby('site_id_msba')['unleaded'].mean()

# Sort the data in descending order of average unleaded sales
average_unleaded_sales = average_unleaded_sales.sort_values(ascending=False)

# Define a color palette for the bars
colors = sns.color_palette('colorblind', len(average_unleaded_sales))

# Create a bar plot for average unleaded sales by site_id with different colors
plt.figure(figsize=(15, 6))
ax = average_unleaded_sales.plot(kind='bar', color=colors)
plt.xlabel('Site ID')
plt.ylabel('Average Unleaded Sales')
plt.title('Average Unleaded Sales by Site ID (Descending Order)')
plt.xticks(rotation=90)  # Keep x-axis labels horizontal

# Add y-axis values with some space at the top of each bar
y_offset = 50  # Adjust this value to control the spacing
for i, v in enumerate(average_unleaded_sales):
    ax.text(i, v + y_offset, str(round(v, 2)), ha='center', va='bottom', fontsize=9, rotation='vertical')

# Adjust y-axis limits to provide space at the top
plt.ylim(bottom=0, top=max(average_unleaded_sales) + y_offset * 2)

plt.tight_layout()

# Show the bar plot for unleaded sales
plt.show()

3.9 Distribution of Sales (Inside + FoodService)¶

In [ ]:
sns.set(style="whitegrid")
sales_columns = ['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']
df_tsdm['total_daily_sales'] = df_tsdm[sales_columns].sum(axis=1)

fig, axes = plt.subplots(3,1, figsize=(18, 12))
fig.suptitle('Histograms of Sales-Related Features')

for i, col in enumerate(sales_columns + ['total_daily_sales']):
    sns.histplot(df_tsdm[col], bins=50, kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

Daily Inside Sales: Appears to be right-skewed, with most stores having lower daily inside sales.

Daily Food Service: This is also right-skewed; the majority of stores have lower daily food service sales.

Total Sales: The aggregated sales show a right-skewed distribution.

These skewnesses suggest that outlier detection and possibly transformation may be necessary for modeling.

3.10 Distribution of Fuel Sales (Diesel + Unleaded)¶

In [ ]:
fuel_features = ['diesel', 'unleaded']
df_tsdm['total_daily_fuel_sales'] = df_tsdm[fuel_features].sum(axis=1)
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
fig.suptitle('Distribution of Fuel-related sales')

for i, col in enumerate(fuel_features + ['total_daily_fuel_sales']):
    sns.histplot(df_tsdm[col], bins=50, kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col + '(in Gallons)')
    axes[i].set_ylabel('Frequency')

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

Diesel: Exhibits a right-skewed distribution similar to the previous variables.

Unleaded: Slightly right-skewed but relatively more evenly distributed compared to other variables.

Total Fuel Sales: Exhibits right-skewed distribution which is inline with diesel and unleaded sales

3.11 Correlation matrix of all the 4 target variables¶

In [ ]:
variables = ['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded']
correlation_matrix = df_tsdm[variables].corr()

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

3.12 Average Inside Sales by Day of the Week¶

In [ ]:
# Convert 'calendar.calendar_day_date' to a datetime object if it's not already
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])

# Extract the day of the week as an integer (0=Monday, 1=Tuesday, ..., 6=Sunday)
df_tsdm['calendar.day_of_week'] = df_tsdm['calendar.calendar_day_date'].dt.weekday

# Group the data by 'calendar.day_of_week' and calculate the average inside sales for each day of the week
average_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['daily_yoy_ndt.total_inside_sales'].mean()

# Define the names of days of the week
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Sort the days of the week in descending order of average sales
sorted_days_of_week = [day for _, day in sorted(zip(average_sales_by_day_of_week, days_of_week), reverse=True)]
sorted_sales = sorted(average_sales_by_day_of_week, reverse=True)

# Create a list of unique colors for each day
unique_colors = plt.cm.viridis(np.linspace(0, 1, len(sorted_days_of_week)))

# Create a bar plot with different colors for each day and sorted in descending order
plt.figure(figsize=(10, 6))
bars = plt.bar(sorted_days_of_week, sorted_sales, color=unique_colors)
plt.title('Average Inside Sales by Day of the Week (Descending Order)')
plt.xlabel('Day of the Week')
plt.ylabel('Average Inside Sales')

# Annotate each bar with its respective y-axis value
for bar, value in zip(bars, sorted_sales):
    plt.text(bar.get_x() + bar.get_width() / 2 - 0.15, bar.get_height() + 50, str(round(value, 2)), ha='center', color='black', fontsize=10)

plt.tight_layout()
plt.show()

Friday is the day of the week where the Average Inside sales are highest. We can see a gradual increase in average sales from Monday to Friday with the sales dipping on the weekends Saturday and Sunday.

3.13 Average Food Service Sales by Day of the Week¶

In [ ]:
# Convert 'calendar.calendar_day_date' to a datetime object if it's not already
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])

# Extract the day of the week as an integer (0=Monday, 1=Tuesday, ..., 6=Sunday)
df_tsdm['calendar.day_of_week'] = df_tsdm['calendar.calendar_day_date'].dt.weekday

# Group the data by 'calendar.day_of_week' and calculate the average food service sales for each day of the week
average_food_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['daily_yoy_ndt.total_food_service'].mean()

# Define the names of days of the week
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Sort the days of the week in descending order of average food service sales
sorted_days_of_week = [day for _, day in sorted(zip(average_food_sales_by_day_of_week, days_of_week), reverse=True)]
sorted_food_sales = sorted(average_food_sales_by_day_of_week, reverse=True)

# Create a list of unique colors for each day
unique_colors = plt.cm.viridis(np.linspace(0, 1, len(sorted_days_of_week)))

# Create a bar plot with different colors for each day and sorted in descending order
plt.figure(figsize=(10, 6))
bars = plt.bar(sorted_days_of_week, sorted_food_sales, color=unique_colors)
plt.title('Average Food Service Sales by Day of the Week (Descending Order)')
plt.xlabel('Day of the Week')
plt.ylabel('Average Food Service Sales')

# Annotate each bar with its respective y-axis value
for bar, value in zip(bars, sorted_food_sales):
    plt.text(bar.get_x() + bar.get_width() / 2 - 0.15, bar.get_height() + 20, str(round(value, 2)), ha='center', color='black', fontsize=10)

plt.tight_layout()
plt.show()

Friday is the day of the week where the Average food sales are highest

3.14 Combined Average Sales by Day of the Week¶

In [ ]:
# Convert 'calendar.calendar_day_date' to a datetime object if it's not already
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])

# Extract the day of the week as an integer (0=Monday, 1=Tuesday, ..., 6=Sunday)
df_tsdm['calendar.day_of_week'] = df_tsdm['calendar.calendar_day_date'].dt.weekday

# Define the names of days of the week starting from Monday
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Group the data by 'calendar.day_of_week' and calculate the average inside sales and food service sales for each day of the week
average_inside_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['daily_yoy_ndt.total_inside_sales'].mean()
average_food_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['daily_yoy_ndt.total_food_service'].mean()

# Combine inside sales and food service sales for each day of the week
combined_sales = average_inside_sales_by_day_of_week + average_food_sales_by_day_of_week

# Create a list of unique colors for each day
unique_colors = plt.cm.viridis(np.linspace(0, 1, len(days_of_week)))

# Create a bar plot with different colors for each day
plt.figure(figsize=(12, 6))
bars = plt.bar(days_of_week, combined_sales, color=unique_colors)
plt.title('Combined Average Sales by Day of the Week (Starting from Monday)')
plt.xlabel('Day of the Week')
plt.ylabel('Combined Average Sales')

# Annotate each bar with its respective y-axis value
for bar, value in zip(bars, combined_sales):
    plt.text(bar.get_x() + bar.get_width() / 2 - 0.15, bar.get_height() + 50, str(round(value, 2)), ha='center', color='black', fontsize=10)

plt.tight_layout()
plt.show()

Friday Records the Highest sales. There is a gradual increase from Monday to Friday and then a dip on Weekends

3.15 Average Diesel & Unleaded sales by Day of the Week¶

In [ ]:
# Convert 'calendar.calendar_day_date' to a datetime object if it's not already
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])

# Extract the day of the week as an integer (0=Monday, 1=Tuesday, ..., 6=Sunday)
df_tsdm['calendar.day_of_week'] = df_tsdm['calendar.calendar_day_date'].dt.weekday

# Group the data by 'calendar.day_of_week' and calculate the average diesel and unleaded sales for each day of the week
average_diesel_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['diesel'].mean()
average_unleaded_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['unleaded'].mean()

# Define the names of days of the week
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Sort the days of the week in the order they appear (no sorting needed)
sorted_days_of_week = days_of_week

# Create a list of unique colors for each day
unique_colors = plt.cm.viridis(np.linspace(0, 1, len(sorted_days_of_week)))

# Create separate bar plots for diesel and unleaded sales, using the same x-axis and y-axis
plt.figure(figsize=(10, 6))

# Plot Diesel Sales
plt.bar(sorted_days_of_week, average_diesel_sales_by_day_of_week, color=unique_colors, label='Diesel')
plt.title('Average Diesel Sales by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Diesel Sales')
plt.legend()
plt.tight_layout()

# Show the first bar plot
plt.show()

# Create a new figure for Unleaded Sales
plt.figure(figsize=(10, 6))

# Plot Unleaded Sales
plt.bar(sorted_days_of_week, average_unleaded_sales_by_day_of_week, color=unique_colors, label='Unleaded')
plt.title('Average Unleaded Sales by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Unleaded Sales')
plt.legend()
plt.tight_layout()

# Show the second bar plot
plt.show()

Diesel sales are highest on Wednesday and Unleaded sales are highest on Friday

3.16 Time Series Sliders¶

In [ ]:
# Extract month and year from the date column
df_tsdm['Date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])

# Create a time series slider chart for daily sales
fig = px.line(df_tsdm, x='Date', y=[ 'diesel', 'unleaded'],
              labels={'Date': 'Date', 'value': 'Sales'},
              title='Time Series Slider Chart of Daily Sales- Diesel & Unleaded',
              width=1200, height=600)

# Add a slider for selecting the date range
fig.update_xaxes(rangeslider_visible=True)

# Show the plot
fig.show()

Time Series slider chart of Fuel Sales

In [ ]:
# Extract date, year, and month from the date column
df_tsdm['Date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
df_tsdm['Year'] = df_tsdm['Date'].dt.year
df_tsdm['Month'] = df_tsdm['Date'].dt.strftime('%Y-%m-%d')

# Group the data by date and calculate the sum of sales columns
grouped = df_tsdm.groupby(['Year', 'Month'])[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded']].sum().reset_index()

# Create a vertical bar chart for daily sales
fig1 = go.Figure(data=[
    go.Bar(name='Inside Sales', x=grouped['Month'], y=grouped['daily_yoy_ndt.total_inside_sales'], marker_color='blue'),
    go.Bar(name='Food Service', x=grouped['Month'], y=grouped['daily_yoy_ndt.total_food_service'], marker_color='green'),
    go.Bar(name='Diesel', x=grouped['Month'], y=grouped['diesel'], marker_color='orange'),
    go.Bar(name='Unleaded', x=grouped['Month'], y=grouped['unleaded'], marker_color='red')
])

# Add a time series slider chart for daily sales
fig2 = px.line(grouped, x='Month', y=['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded'],
              labels={'Month': 'Date', 'value': 'Sales'},
              title='Time Series Slider Chart of Daily Sales',
              width=800, height=400, range_x=[grouped['Month'].min(), grouped['Month'].max()])

# Combine the two figures into subplots
fig = sp.make_subplots(rows=2, cols=1, shared_xaxes=True)

# Add the vertical bar chart to the first subplot
fig.add_trace(fig1.data[0], row=1, col=1)
fig.add_trace(fig1.data[1], row=1, col=1)
fig.add_trace(fig1.data[2], row=1, col=1)
fig.add_trace(fig1.data[3], row=1, col=1)

# Add the time series slider chart to the second subplot
fig.add_trace(fig2.data[0], row=2, col=1)
fig.add_trace(fig2.data[1], row=2, col=1)
fig.add_trace(fig2.data[2], row=2, col=1)
fig.add_trace(fig2.data[3], row=2, col=1)

# Update the layout of the combined figure
fig.update_layout(
    title='Combined Vertical Bar Chart and Time Series Slider Chart',
    xaxis_title='Date',
    yaxis_title='Sales',
    xaxis2=dict(rangeslider=dict(visible=True)),
    yaxis2=dict(title='Sales'),
)

# Show the combined figure
fig.show()

Time Series slider chart of all Target Variables

3.17 Weekly Sales (Inside + Food Service)¶

Total Inside Sales - each week

In [ ]:
# Convert the 'calendar.calendar_day_date' column to a datetime format
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])

# Extract the fiscal year and fiscal week from the date column
df_tsdm['Fiscal_Year'] = df_tsdm['calendar.calendar_day_date'].dt.year
df_tsdm['Fiscal_Week'] = df_tsdm['calendar.calendar_day_date'].dt.strftime('%U')

# Group the data by fiscal year and fiscal week, then calculate the total inside sales for each week
weekly_inside_sales = df_tsdm.groupby(['Fiscal_Year', 'Fiscal_Week'])['daily_yoy_ndt.total_inside_sales'].sum()

# Create a line plot of total inside sales by fiscal week
plt.figure(figsize=(12, 6))

# Specify custom colors for each year
colors = {
    2021: 'blue',
    2022: 'green',
    2023: 'red'
}

# Plot the data for each fiscal year separately with different colors
for year in df_tsdm['Fiscal_Year'].unique():
    data = weekly_inside_sales[year].reset_index()
    plt.plot(data['Fiscal_Week'], data['daily_yoy_ndt.total_inside_sales'], marker='o', linestyle='-', label=f'Year {year}', color=colors[year])

plt.title('Total Inside Sales by Fiscal Week')
plt.xlabel('Fiscal Week')
plt.ylabel('Total Inside Sales')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

For year 2021, Week 51 has the highest Inside Sales For year 2022, Week 19 has the highest Inside Sales For year 2023, Week 17 has the highest Inside Sales

Total Food Service Sales - each week

In [ ]:
# Convert the 'calendar.calendar_day_date' column to a datetime format
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])

# Extract the fiscal year and fiscal week from the date column
df_tsdm['Fiscal_Year'] = df_tsdm['calendar.calendar_day_date'].dt.year
df_tsdm['Fiscal_Week'] = df_tsdm['calendar.calendar_day_date'].dt.strftime('%U')

# Group the data by fiscal year and fiscal week, then calculate the total food service sales for each week
weekly_food_service_sales = df_tsdm.groupby(['Fiscal_Year', 'Fiscal_Week'])['daily_yoy_ndt.total_food_service'].sum()

# Create a line plot of total food service sales by fiscal week
plt.figure(figsize=(12, 6))

# Specify custom colors for each year
colors = {2021: 'blue', 2022: 'green', 2023: 'red'}

# Plot the data for each fiscal year separately
for year in df_tsdm['Fiscal_Year'].unique():
    data = weekly_food_service_sales[year].reset_index()
    plt.plot(data['Fiscal_Week'], data['daily_yoy_ndt.total_food_service'], marker='o', linestyle='-', label=f'Year {year}', color=colors[year])

plt.title('Total Food Service Sales by Fiscal Week')
plt.xlabel('Fiscal Week')
plt.ylabel('Total Food Service Sales')
plt.legend()
plt.grid(True)


plt.tight_layout()
plt.show()

For year 2021, Week 50 has the highest Food Service Sales For year 2022, Week 32 has the highest Food Service Sales For year 2023, Week 17 has the highest Food Service Sales

Total Combined Sales - each week

In [ ]:
# Convert the 'calendar.calendar_day_date' column to a datetime format
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])

# Extract the fiscal year and fiscal week from the date column
df_tsdm['Fiscal_Year'] = df_tsdm['calendar.calendar_day_date'].dt.year
df_tsdm['Fiscal_Week'] = df_tsdm['calendar.calendar_day_date'].dt.strftime('%U')

# Group the data by fiscal year and fiscal week, then calculate the total sales for each week
weekly_total_sales = df_tsdm.groupby(['Fiscal_Year', 'Fiscal_Week'])[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']].sum()

# Create a line plot for each fiscal year with specified colors
plt.figure(figsize=(12, 6))

colors = ['b', 'g', 'r']  # Specify custom colors for each year
x_values = []  # Initialize x_values list to accumulate x values

for i, year in enumerate(weekly_total_sales.index.levels[0]):
    data_year = weekly_total_sales.loc[year]
    data_year['Combined_Total_Sales'] = data_year['daily_yoy_ndt.total_inside_sales'] + data_year['daily_yoy_ndt.total_food_service']

    # Plot the combined total sales for the year with the specified color
    x_year = list(range(len(data_year)))
    plt.plot(x_year, data_year['Combined_Total_Sales'], marker='o', linestyle='-', label=f'Year {year}', color=colors[i])

    # Accumulate x values
    x_values += x_year

plt.title('Combined Total Sales by Fiscal Week (Each Year Separately)')
plt.xlabel('Week Number')
plt.ylabel('Combined Total Sales')
plt.legend()
plt.grid(True)

# Set the tick locations
plt.xticks(x_values)

plt.tight_layout()
plt.show()

For year 2021, Week 48 has the highest Combined Sales For year 2022, Week 19 has the highest Combined Sales For year 2023, Week 16 has the highest Combined Sales

In [ ]:
# Convert the 'calendar.calendar_day_date' column to datetime if it's not already in datetime format
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])

# Create date-related features
df_tsdm['year'] = df_tsdm['calendar.calendar_day_date'].dt.year
df_tsdm['month'] = df_tsdm['calendar.calendar_day_date'].dt.month
df_tsdm['day'] = df_tsdm['calendar.calendar_day_date'].dt.day
df_tsdm['quarter'] = df_tsdm['calendar.calendar_day_date'].dt.quarter
df_tsdm['day_of_week'] = df_tsdm['calendar.calendar_day_date'].dt.dayofweek  # 0=Monday, 1=Tuesday, ..., 6=Sunday
df_tsdm['is_weekend'] = (df_tsdm['day_of_week'] >= 5).astype(int)  # 1 if it's a weekend, 0 otherwise

# Print the modified DataFrame with new features
print(df_tsdm.head())
   Unnamed: 0 capital_projects.soft_opening_date calendar.calendar_day_date  \
0           1                          6/14/2022                 2022-06-17   
1           2                          6/14/2022                 2022-06-22   
2           3                          6/14/2022                 2022-06-23   
3           4                          6/14/2022                 2022-06-26   
4           5                          6/14/2022                 2022-06-27   

   calendar.fiscal_week_id_for_year  calendar.day_of_week  \
0                                25                     4   
1                                25                     2   
2                                25                     3   
3                                26                     6   
4                                26                     0   

  calendar_information.holiday calendar_information.type_of_day  \
0                         NONE                          WEEKDAY   
1                         NONE                          WEEKDAY   
2                         NONE                          WEEKDAY   
3                         NONE                          WEEKEND   
4                         NONE                          WEEKDAY   

   daily_yoy_ndt.total_inside_sales  daily_yoy_ndt.total_food_service  \
0                         2168.2920                          861.6930   
1                         2051.5635                          808.0275   
2                         2257.5000                          966.4410   
3                         1520.5925                          542.3250   
4                         1897.6930                          771.4525   

     diesel  ...  Year       Month  Fiscal_Year  Fiscal_Week  year  month day  \
0  722.7745  ...  2022  2022-06-17         2022           24  2022      6  17   
1  730.4850  ...  2022  2022-06-22         2022           25  2022      6  22   
2  895.7970  ...  2022  2022-06-23         2022           25  2022      6  23   
3  584.2900  ...  2022  2022-06-26         2022           26  2022      6  26   
4  852.2605  ...  2022  2022-06-27         2022           26  2022      6  27   

   quarter day_of_week  is_weekend  
0        2           4           0  
1        2           2           0  
2        2           3           0  
3        2           6           1  
4        2           0           0  

[5 rows x 25 columns]

3.18 Monthly Aggregated Plot - Total inside sales and Food Service sales¶

In [ ]:
#Create a stacked bar chart for monthly total inside sales and total food service sales in millions of USD
monthly_sales = df_tsdm.groupby(['year', 'month'])[['daily_yoy_ndt.total_food_service','daily_yoy_ndt.total_inside_sales']].sum() / 1_000_000  # Divide by 1 million
plt.figure(figsize=(12, 6))

# Specify custom colors for the bars
colors = ['green', 'orange']

ax = monthly_sales.plot(kind='bar', stacked=True, color=colors, figsize=(12, 6))
plt.title('Monthly Sales (Stacked)')
plt.xlabel('Month and Year')
plt.ylabel('Sales (in millions USD)')
plt.grid(axis='y')

# Annotate data values on top of each bar segment with exact coordinates
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy()
    if height > 0.2:  # Only annotate bars with significant height
        ax.annotate(f'{height:.2f}M', (x + width/2, y + height + 0.02), ha='center', va='center', fontsize=10, color='black')

# Set custom x-labels using month and year values from the DataFrame
custom_labels = [f'{int(x[1])}, {int(x[0])}' for x in monthly_sales.index]
ax.set_xticks(range(len(custom_labels)))
ax.set_xticklabels(custom_labels, rotation=45, ha='right')

plt.legend(['Inside Sales', 'Food Service Sales'], loc='upper right')
plt.tight_layout()
plt.show()
<Figure size 1200x600 with 0 Axes>

May 2022 has the highest Combined Sales

Seasonal Plot: Plot seasonal patterns by aggregating data over seasons or quarters.

In [ ]:
#Plot quarterly total inside sales
quarterly_sales = df_tsdm.groupby(['year', 'quarter'])['daily_yoy_ndt.total_inside_sales'].sum()
plt.figure(figsize=(12, 6))
quarterly_sales.plot(kind='bar', rot=0)
plt.title('Quarterly Total Inside Sales')
plt.xlabel('Quarter')
plt.ylabel('Total Inside Sales')
plt.grid(axis='y')
plt.show()

During 2nd,3rd quarters of Year 2022, sales were found to be highest

Box Plot by Day of the Week: Visualize how a feature varies by the day of the week.

In [ ]:
#Box plot of inside sales by day of the week
plt.figure(figsize=(10, 6))
plt.boxplot([df_tsdm[df_tsdm['day_of_week'] == i]['daily_yoy_ndt.total_inside_sales'] for i in range(7)],
            labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.title('Inside Sales by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Inside Sales')
plt.grid(axis='y')
plt.show()

Combine Sales of Total Inside and Total Food Service

In [ ]:
# Group the data by year and calculate the sum of inside sales and food service sales for each year
yearly_sales = df_tsdm.groupby(df_tsdm['calendar.calendar_day_date'].dt.year)[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']].sum()

# Create a new column for the combined sales
yearly_sales['total_combined_sales'] = yearly_sales['daily_yoy_ndt.total_inside_sales'] + yearly_sales['daily_yoy_ndt.total_food_service']

# Extract years and sales values
years = yearly_sales.index
sales_values = yearly_sales['total_combined_sales']

# Create a bar plot with different colors
colors = ['blue', 'green', 'red']
plt.figure(figsize=(12, 6))
bars = plt.bar(years, sales_values, color=colors)

# Add truncated values on top of each bar
for bar, value in zip(bars, sales_values):
    truncated_value = round(value, 1)  # Truncate to 1 decimal place
    plt.text(bar.get_x() + bar.get_width() / 2, value, f'{truncated_value:.1f}', ha='center', va='bottom', fontsize=12)

plt.title('Combined Sales by Year')
plt.xlabel('Year')
plt.ylabel('Total Combined Sales')
plt.xticks(years)
plt.grid(axis='y')
plt.show()

Frequency Distribution of Day Types¶

In [ ]:
# Define a color palette for day types
day_type_colors = {'WEEKDAY': 'ORANGE', 'WEEKEND': 'RED'}

# Frequency distribution of day types with different colors
day_type_counts = df_tsdm['calendar_information.type_of_day'].value_counts()
total_days = len(df_tsdm)  # Total number of days

plt.figure(figsize=(8, 6))
ax = sns.barplot(x=day_type_counts.index, y=day_type_counts.values, palette=day_type_colors)

# Add count annotations above the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()} ({(p.get_height() / total_days) * 100:.1f}%)',
                (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=12, color='black', xytext=(0, 10),
                textcoords='offset points')

plt.title('Frequency Distribution of Day Types')
plt.xlabel('Day Type')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

71.6% of the days are Weekday and 28.4% of the days are Weekends

Box Plots by Day Type - Distribution of Sales

In [ ]:
# Define a color palette for day types
day_type_colors = {'WEEKDAY': 'ORANGE', 'WEEKEND': 'RED'}

# Box plots of sales by day type with different colors
plt.figure(figsize=(10, 6))
sns.boxplot(x='calendar_information.type_of_day', y='daily_yoy_ndt.total_inside_sales', data=df_tsdm, palette=day_type_colors)
plt.title('Sales Distribution by Day Type')
plt.xlabel('Day Type')
plt.ylabel('Inside Sales')
plt.xticks(rotation=45)
plt.show()

4. EDA- Quantitative Data¶

4.1 Set up, import data, and prepare Quantitative data¶

In [ ]:
df_qdm = pd.read_csv("qualitative_data_msba.csv")
In [ ]:
num_columns = df_qdm.shape[1]
print("Number of columns in df_qdm:", num_columns)
Number of columns in df_qdm: 55
In [ ]:
df_qdm.describe()
Out[ ]:
Unnamed: 0 open_year square_feet front_door_count years_since_last_project parking_spaces x1_mile_pop x1_mile_emp x1_mile_income x1_2_mile_pop ... traditional_forecourt_fueling_positions rv_lanes_fueling_positions hi_flow_lanes_fueling_positions hi_flow_lanes_fueling_positions_2 rv_lanes_fueling_positions_2 mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
count 37.000000 37.000000 37.00000 37.0 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 ... 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000
mean 19.000000 2021.324324 4970.27027 2.0 1.648649 37.405405 6703.567568 4757.648649 53300.378378 1833.108108 ... 14.270270 2.513514 3.324324 3.324324 2.513514 2.378378 2.351351 4.648649 1.702703 23040.405405
std 10.824355 0.474579 575.93121 0.0 0.483978 5.918237 5694.011350 4697.168291 24333.027254 1915.140476 ... 3.948619 2.049683 2.925501 2.925501 2.049683 0.923500 0.856875 1.751447 0.740303 730.069801
min 1.000000 2021.000000 2933.00000 2.0 1.000000 23.000000 0.000000 56.000000 0.000000 0.000000 ... 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 21560.000000
25% 10.000000 2021.000000 5046.00000 2.0 1.000000 34.000000 1984.000000 1771.000000 39538.000000 262.000000 ... 12.000000 0.000000 0.000000 0.000000 0.000000 2.000000 2.000000 4.000000 1.000000 22540.000000
50% 19.000000 2021.000000 5046.00000 2.0 2.000000 38.000000 5574.000000 3895.000000 46356.000000 1003.000000 ... 12.000000 4.000000 5.000000 5.000000 4.000000 2.000000 2.000000 4.000000 2.000000 22890.000000
75% 28.000000 2022.000000 5046.00000 2.0 2.000000 41.000000 11269.000000 6002.000000 73519.000000 2686.000000 ... 16.000000 4.000000 5.000000 5.000000 4.000000 3.000000 3.000000 6.000000 2.000000 23555.000000
max 37.000000 2022.000000 6134.00000 2.0 2.000000 49.000000 18692.000000 26077.000000 110957.000000 5923.000000 ... 24.000000 6.000000 9.000000 9.000000 6.000000 5.000000 5.000000 10.000000 4.000000 24535.000000

8 rows × 28 columns

4.2 Columns with Missing values¶

In [ ]:
# Check for null values in the Qualitative DataFrame
null_values_qualitative = df_qdm.isnull().sum()

# Display the columns with null values and their counts
print("Null Values in Qualitative DataFrame:")
print(null_values_qualitative[null_values_qualitative > 0])
Null Values in Qualitative DataFrame:
rv_lanes_layout                14
rv_lanes_stack_type            14
hi_flow_lanes_layout           15
hi_flow_lanes_stack_type       15
hi_flow_rv_lanes_layout        14
hi_flow_rv_lanes_stack_type    14
dtype: int64
In [ ]:
#Check for Unique values in the columns with Missing values
print('rv_lanes_layout Unique values: ', df_qdm['rv_lanes_layout'].unique())
print('rv_lanes_stack_type Unique values: ', df_qdm['rv_lanes_stack_type'].unique())
print('hi_flow_lanes_layout Unique values: ', df_qdm['hi_flow_lanes_layout'].unique())
print('hi_flow_lanes_stack_type Unique values: ', df_qdm['hi_flow_lanes_stack_type'].unique())
print('hi_flow_rv_lanes_layout Unique values: ', df_qdm['hi_flow_rv_lanes_layout'].unique())
print('hi_flow_rv_lanes_stack_type Unique values: ', df_qdm['hi_flow_rv_lanes_stack_type'].unique())
rv_lanes_layout Unique values:  ['Stack' 'In-Line' nan]
rv_lanes_stack_type Unique values:  ['HF/RV' 'None' nan]
hi_flow_lanes_layout Unique values:  ['Stack' 'Combo' nan]
hi_flow_lanes_stack_type Unique values:  ['HF/RV' nan]
hi_flow_rv_lanes_layout Unique values:  ['Stack' 'Combo' 'In-Line' nan]
hi_flow_rv_lanes_stack_type Unique values:  ['HF/RV' 'None' nan]

Since the missing values are mostly categorical and can be interpreted as not present. We are imputing a new 'Not Present' Category in place of null values. Since the 'rv_lanes_stack_type' and 'hi_flow_rv_lanes_stack_type' already have the 'None' Category, we will impute the missing values in this columns with 'None' instead of 'Not Present'.

4.3 Imputing Missing Values¶

In [ ]:
#Imputing Missing values with 'None' and 'Not Present' categories
df_qdm['rv_lanes_stack_type'] = df_qdm['rv_lanes_stack_type'].fillna('None')
df_qdm['hi_flow_rv_lanes_stack_type'] = df_qdm['hi_flow_rv_lanes_stack_type'].fillna('None')
df_qdm = df_qdm.fillna('Not Present')
In [ ]:
df_qdm.isnull().sum()
Out[ ]:
Unnamed: 0                                 0
open_year                                  0
square_feet                                0
front_door_count                           0
years_since_last_project                   0
parking_spaces                             0
lottery                                    0
freal                                      0
bonfire_grill                              0
pizza                                      0
cinnabon                                   0
godfather_s_pizza                          0
ethanol_free                               0
diesel                                     0
hi_flow_lanes                              0
rv_lanes                                   0
hi_flow_rv_lanes                           0
def                                        0
cat_scales                                 0
car_wash                                   0
ev_charging                                0
rv_dumps                                   0
propane                                    0
x1_mile_pop                                0
x1_mile_emp                                0
x1_mile_income                             0
x1_2_mile_pop                              0
x1_2_mile_emp                              0
x1_2_mile_income                           0
x5_min_pop                                 0
x5_min_emp                                 0
x5_min_inc                                 0
x7_min_pop                                 0
x7_min_emp                                 0
x7_min_inc                                 0
traditional_forecourt_fueling_positions    0
traditional_forecourt_layout               0
traditional_forecourt_stack_type           0
rv_lanes_fueling_positions                 0
rv_lanes_layout                            0
rv_lanes_stack_type                        0
hi_flow_lanes_fueling_positions            0
hi_flow_lanes_layout                       0
hi_flow_lanes_stack_type                   0
hi_flow_lanes_fueling_positions_2          0
rv_lanes_fueling_positions_2               0
hi_flow_rv_lanes_layout                    0
hi_flow_rv_lanes_stack_type                0
non_24_hour                                0
self_check_out                             0
mens_toilet_count                          0
mens_urinal_count                          0
womens_toilet_count                        0
womens_sink_count                          0
site_id_msba                               0
dtype: int64

4.4 Finding variability of columns¶

In [ ]:
unique_counts = df_qdm.nunique()

# Sort the unique counts in descending order
unique_counts = unique_counts.sort_values(ascending=False)

# Print the number of unique values for each column in descending order
print(unique_counts)
Unnamed: 0                                 37
x1_mile_pop                                37
x7_min_inc                                 37
x7_min_emp                                 37
x7_min_pop                                 37
x5_min_emp                                 37
x5_min_pop                                 37
x1_mile_emp                                37
x1_2_mile_emp                              37
site_id_msba                               37
x5_min_inc                                 36
x1_mile_income                             36
x1_2_mile_pop                              35
x1_2_mile_income                           33
parking_spaces                             21
square_feet                                 9
hi_flow_lanes_fueling_positions_2           6
hi_flow_lanes_fueling_positions             6
mens_toilet_count                           6
womens_toilet_count                         6
traditional_forecourt_fueling_positions     6
rv_lanes_fueling_positions_2                5
rv_lanes_fueling_positions                  5
mens_urinal_count                           5
womens_sink_count                           5
hi_flow_rv_lanes_layout                     4
hi_flow_lanes_layout                        3
rv_lanes_layout                             3
traditional_forecourt_stack_type            3
hi_flow_lanes_stack_type                    2
rv_lanes_stack_type                         2
hi_flow_rv_lanes_stack_type                 2
traditional_forecourt_layout                2
pizza                                       2
cat_scales                                  2
cinnabon                                    2
ethanol_free                                2
hi_flow_lanes                               2
rv_lanes                                    2
hi_flow_rv_lanes                            2
def                                         2
years_since_last_project                    2
rv_dumps                                    2
propane                                     2
bonfire_grill                               2
freal                                       2
open_year                                   2
lottery                                     2
ev_charging                                 1
car_wash                                    1
non_24_hour                                 1
self_check_out                              1
diesel                                      1
godfather_s_pizza                           1
front_door_count                            1
dtype: int64

4.5 Detecting Outliers¶

In [ ]:
from scipy import stats


# Select the columns to check for outliers (e.g., numerical columns)
numerical_columns = df_qdm.select_dtypes(include=['int64', 'float64'])

# Define a threshold for outliers
threshold = 3

# Create a DataFrame to store the z-scores
z_scores_df = (numerical_columns - numerical_columns.mean()) / numerical_columns.std()

# Calculate the number of outliers in each column
outliers_count = (z_scores_df.abs() > threshold).sum()

# Display the number of outliers in each column
print("Number of outliers in each column:")
print(outliers_count)
Number of outliers in each column:
Unnamed: 0                                 0
open_year                                  0
square_feet                                2
front_door_count                           0
years_since_last_project                   0
parking_spaces                             0
x1_mile_pop                                0
x1_mile_emp                                1
x1_mile_income                             0
x1_2_mile_pop                              0
x1_2_mile_emp                              1
x1_2_mile_income                           0
x5_min_pop                                 1
x5_min_emp                                 0
x5_min_inc                                 0
x7_min_pop                                 1
x7_min_emp                                 2
x7_min_inc                                 0
traditional_forecourt_fueling_positions    0
rv_lanes_fueling_positions                 0
hi_flow_lanes_fueling_positions            0
hi_flow_lanes_fueling_positions_2          0
rv_lanes_fueling_positions_2               0
mens_toilet_count                          0
mens_urinal_count                          1
womens_toilet_count                        1
womens_sink_count                          1
site_id_msba                               0
dtype: int64
In [ ]:
unique_site_id_count = df_tsdm['site_id_msba'].nunique()

# Print the count of unique site IDs
print("Count of unique site IDs:", unique_site_id_count)


# Get the unique site IDs
unique_site_ids = df_tsdm['site_id_msba'].unique()

# Print the unique site IDs
print(unique_site_ids)
Count of unique site IDs: 38
[24535 24255 24220 24150 23905 23835 23765 23730 23660 23555 23485 23450
 23415 23380 23345 23240 23135 23065 22925 22890 22855 22820 22785 22750
 22715 22680 22645 22575 22540 22505 22400 22330 22260 22120 22085 22015
 21980 21560]
In [ ]:
df_qdm.describe()
Out[ ]:
Unnamed: 0 open_year square_feet front_door_count years_since_last_project parking_spaces x1_mile_pop x1_mile_emp x1_mile_income x1_2_mile_pop ... traditional_forecourt_fueling_positions rv_lanes_fueling_positions hi_flow_lanes_fueling_positions hi_flow_lanes_fueling_positions_2 rv_lanes_fueling_positions_2 mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
count 37.000000 37.000000 37.00000 37.0 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 ... 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000
mean 19.000000 2021.324324 4970.27027 2.0 1.648649 37.405405 6703.567568 4757.648649 53300.378378 1833.108108 ... 14.270270 2.513514 3.324324 3.324324 2.513514 2.378378 2.351351 4.648649 1.702703 23040.405405
std 10.824355 0.474579 575.93121 0.0 0.483978 5.918237 5694.011350 4697.168291 24333.027254 1915.140476 ... 3.948619 2.049683 2.925501 2.925501 2.049683 0.923500 0.856875 1.751447 0.740303 730.069801
min 1.000000 2021.000000 2933.00000 2.0 1.000000 23.000000 0.000000 56.000000 0.000000 0.000000 ... 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 21560.000000
25% 10.000000 2021.000000 5046.00000 2.0 1.000000 34.000000 1984.000000 1771.000000 39538.000000 262.000000 ... 12.000000 0.000000 0.000000 0.000000 0.000000 2.000000 2.000000 4.000000 1.000000 22540.000000
50% 19.000000 2021.000000 5046.00000 2.0 2.000000 38.000000 5574.000000 3895.000000 46356.000000 1003.000000 ... 12.000000 4.000000 5.000000 5.000000 4.000000 2.000000 2.000000 4.000000 2.000000 22890.000000
75% 28.000000 2022.000000 5046.00000 2.0 2.000000 41.000000 11269.000000 6002.000000 73519.000000 2686.000000 ... 16.000000 4.000000 5.000000 5.000000 4.000000 3.000000 3.000000 6.000000 2.000000 23555.000000
max 37.000000 2022.000000 6134.00000 2.0 2.000000 49.000000 18692.000000 26077.000000 110957.000000 5923.000000 ... 24.000000 6.000000 9.000000 9.000000 6.000000 5.000000 5.000000 10.000000 4.000000 24535.000000

8 rows × 28 columns

In [ ]:
df_tsdm.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13908 entries, 0 to 13907
Data columns (total 25 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   Unnamed: 0                          13908 non-null  int64         
 1   capital_projects.soft_opening_date  13908 non-null  object        
 2   calendar.calendar_day_date          13908 non-null  datetime64[ns]
 3   calendar.fiscal_week_id_for_year    13908 non-null  int64         
 4   calendar.day_of_week                13908 non-null  int64         
 5   calendar_information.holiday        13908 non-null  object        
 6   calendar_information.type_of_day    13908 non-null  object        
 7   daily_yoy_ndt.total_inside_sales    13908 non-null  float64       
 8   daily_yoy_ndt.total_food_service    13908 non-null  float64       
 9   diesel                              13908 non-null  float64       
 10  unleaded                            13908 non-null  float64       
 11  site_id_msba                        13908 non-null  int64         
 12  total_daily_sales                   13908 non-null  float64       
 13  total_daily_fuel_sales              13908 non-null  float64       
 14  Date                                13908 non-null  datetime64[ns]
 15  Year                                13908 non-null  int64         
 16  Month                               13908 non-null  object        
 17  Fiscal_Year                         13908 non-null  int64         
 18  Fiscal_Week                         13908 non-null  object        
 19  year                                13908 non-null  int64         
 20  month                               13908 non-null  int64         
 21  day                                 13908 non-null  int64         
 22  quarter                             13908 non-null  int64         
 23  day_of_week                         13908 non-null  int64         
 24  is_weekend                          13908 non-null  int64         
dtypes: datetime64[ns](2), float64(6), int64(12), object(5)
memory usage: 2.7+ MB

4.6 Separate variables into categorical and continuous¶

In [ ]:
# Separate variables into categorical and continuous
categorical_vars = ['lottery', 'freal', 'bonfire_grill', 'pizza', 'cinnabon', 'ethanol_free', 'hi_flow_lanes', 'rv_lanes', 'cat_scales', 'rv_dumps', 'propane', 'traditional_forecourt_layout', 'traditional_forecourt_stack_type', 'rv_lanes_layout', 'rv_lanes_stack_type', 'hi_flow_lanes_layout', 'hi_flow_rv_lanes_layout']  # List to store categorical variable names
continuous_vars = ['open_year', 'years_since_last_project', 'lottery','square_feet', 'parking_spaces', 'x1_mile_pop', 'x1_mile_emp', 'x1_mile_income', 'x1_2_mile_pop', 'x1_2_mile_emp', 'x1_2_mile_income', 'x5_min_pop', 'x5_min_emp', 'x5_min_inc', 'x7_min_pop', 'x7_min_emp', 'x7_min_inc', 'traditional_forecourt_fueling_positions', 'rv_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions', 'mens_toilet_count', 'mens_urinal_count', 'womens_toilet_count', 'womens_sink_count', 'site_id_msba']   # List to store continuous variable names


# Print the lists of categorical and continuous variables
print("Categorical Variables:")
print(categorical_vars)

print("\nContinuous Variables:")
print(continuous_vars)
Categorical Variables:
['lottery', 'freal', 'bonfire_grill', 'pizza', 'cinnabon', 'ethanol_free', 'hi_flow_lanes', 'rv_lanes', 'cat_scales', 'rv_dumps', 'propane', 'traditional_forecourt_layout', 'traditional_forecourt_stack_type', 'rv_lanes_layout', 'rv_lanes_stack_type', 'hi_flow_lanes_layout', 'hi_flow_rv_lanes_layout']

Continuous Variables:
['open_year', 'years_since_last_project', 'lottery', 'square_feet', 'parking_spaces', 'x1_mile_pop', 'x1_mile_emp', 'x1_mile_income', 'x1_2_mile_pop', 'x1_2_mile_emp', 'x1_2_mile_income', 'x5_min_pop', 'x5_min_emp', 'x5_min_inc', 'x7_min_pop', 'x7_min_emp', 'x7_min_inc', 'traditional_forecourt_fueling_positions', 'rv_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions', 'mens_toilet_count', 'mens_urinal_count', 'womens_toilet_count', 'womens_sink_count', 'site_id_msba']
In [ ]:
df_qdm['open_year'].value_counts()
Out[ ]:
2021    25
2022    12
Name: open_year, dtype: int64
In [ ]:
num_columns = df_qdm.shape[1]
print("Number of columns in df_qdm:", num_columns)
Number of columns in df_qdm: 55
In [ ]:
df_qdm.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 55 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   Unnamed: 0                               37 non-null     int64 
 1   open_year                                37 non-null     int64 
 2   square_feet                              37 non-null     int64 
 3   front_door_count                         37 non-null     int64 
 4   years_since_last_project                 37 non-null     int64 
 5   parking_spaces                           37 non-null     int64 
 6   lottery                                  37 non-null     object
 7   freal                                    37 non-null     object
 8   bonfire_grill                            37 non-null     object
 9   pizza                                    37 non-null     object
 10  cinnabon                                 37 non-null     object
 11  godfather_s_pizza                        37 non-null     object
 12  ethanol_free                             37 non-null     object
 13  diesel                                   37 non-null     object
 14  hi_flow_lanes                            37 non-null     object
 15  rv_lanes                                 37 non-null     object
 16  hi_flow_rv_lanes                         37 non-null     object
 17  def                                      37 non-null     object
 18  cat_scales                               37 non-null     object
 19  car_wash                                 37 non-null     object
 20  ev_charging                              37 non-null     object
 21  rv_dumps                                 37 non-null     object
 22  propane                                  37 non-null     object
 23  x1_mile_pop                              37 non-null     int64 
 24  x1_mile_emp                              37 non-null     int64 
 25  x1_mile_income                           37 non-null     int64 
 26  x1_2_mile_pop                            37 non-null     int64 
 27  x1_2_mile_emp                            37 non-null     int64 
 28  x1_2_mile_income                         37 non-null     int64 
 29  x5_min_pop                               37 non-null     int64 
 30  x5_min_emp                               37 non-null     int64 
 31  x5_min_inc                               37 non-null     int64 
 32  x7_min_pop                               37 non-null     int64 
 33  x7_min_emp                               37 non-null     int64 
 34  x7_min_inc                               37 non-null     int64 
 35  traditional_forecourt_fueling_positions  37 non-null     int64 
 36  traditional_forecourt_layout             37 non-null     object
 37  traditional_forecourt_stack_type         37 non-null     object
 38  rv_lanes_fueling_positions               37 non-null     int64 
 39  rv_lanes_layout                          37 non-null     object
 40  rv_lanes_stack_type                      37 non-null     object
 41  hi_flow_lanes_fueling_positions          37 non-null     int64 
 42  hi_flow_lanes_layout                     37 non-null     object
 43  hi_flow_lanes_stack_type                 37 non-null     object
 44  hi_flow_lanes_fueling_positions_2        37 non-null     int64 
 45  rv_lanes_fueling_positions_2             37 non-null     int64 
 46  hi_flow_rv_lanes_layout                  37 non-null     object
 47  hi_flow_rv_lanes_stack_type              37 non-null     object
 48  non_24_hour                              37 non-null     object
 49  self_check_out                           37 non-null     object
 50  mens_toilet_count                        37 non-null     int64 
 51  mens_urinal_count                        37 non-null     int64 
 52  womens_toilet_count                      37 non-null     int64 
 53  womens_sink_count                        37 non-null     int64 
 54  site_id_msba                             37 non-null     int64 
dtypes: int64(28), object(27)
memory usage: 16.0+ KB
In [ ]:
df_qdm.describe()
Out[ ]:
Unnamed: 0 open_year square_feet front_door_count years_since_last_project parking_spaces x1_mile_pop x1_mile_emp x1_mile_income x1_2_mile_pop ... traditional_forecourt_fueling_positions rv_lanes_fueling_positions hi_flow_lanes_fueling_positions hi_flow_lanes_fueling_positions_2 rv_lanes_fueling_positions_2 mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
count 37.000000 37.000000 37.00000 37.0 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 ... 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000
mean 19.000000 2021.324324 4970.27027 2.0 1.648649 37.405405 6703.567568 4757.648649 53300.378378 1833.108108 ... 14.270270 2.513514 3.324324 3.324324 2.513514 2.378378 2.351351 4.648649 1.702703 23040.405405
std 10.824355 0.474579 575.93121 0.0 0.483978 5.918237 5694.011350 4697.168291 24333.027254 1915.140476 ... 3.948619 2.049683 2.925501 2.925501 2.049683 0.923500 0.856875 1.751447 0.740303 730.069801
min 1.000000 2021.000000 2933.00000 2.0 1.000000 23.000000 0.000000 56.000000 0.000000 0.000000 ... 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 21560.000000
25% 10.000000 2021.000000 5046.00000 2.0 1.000000 34.000000 1984.000000 1771.000000 39538.000000 262.000000 ... 12.000000 0.000000 0.000000 0.000000 0.000000 2.000000 2.000000 4.000000 1.000000 22540.000000
50% 19.000000 2021.000000 5046.00000 2.0 2.000000 38.000000 5574.000000 3895.000000 46356.000000 1003.000000 ... 12.000000 4.000000 5.000000 5.000000 4.000000 2.000000 2.000000 4.000000 2.000000 22890.000000
75% 28.000000 2022.000000 5046.00000 2.0 2.000000 41.000000 11269.000000 6002.000000 73519.000000 2686.000000 ... 16.000000 4.000000 5.000000 5.000000 4.000000 3.000000 3.000000 6.000000 2.000000 23555.000000
max 37.000000 2022.000000 6134.00000 2.0 2.000000 49.000000 18692.000000 26077.000000 110957.000000 5923.000000 ... 24.000000 6.000000 9.000000 9.000000 6.000000 5.000000 5.000000 10.000000 4.000000 24535.000000

8 rows × 28 columns

4.7 Visualizing Qualitative Features¶

In [ ]:
# Group the data by 'site_id' and find the maximum and minimum square footage
max_square_footage = df_qdm.groupby('site_id_msba')['square_feet'].max()
min_square_footage = df_qdm.groupby('site_id_msba')['square_feet'].min()

# Find the site_id with the highest and lowest square footage
site_id_with_max_square_footage = max_square_footage.idxmax()
site_id_with_min_square_footage = min_square_footage.idxmin()

# Get the actual values of square footage
highest_square_footage = max_square_footage.max()
lowest_square_footage = min_square_footage.min()

# Print the results
print(f"Site ID with the highest square footage: {site_id_with_max_square_footage}")
print(f"Highest square footage: {highest_square_footage} square feet")

print(f"Site ID with the lowest square footage: {site_id_with_min_square_footage}")
print(f"Lowest square footage: {lowest_square_footage} square feet")
Site ID with the highest square footage: 22890
Highest square footage: 6134 square feet
Site ID with the lowest square footage: 24150
Lowest square footage: 2933 square feet
In [ ]:
# Sort the DataFrame by 'square_feet' in ascending order
df_qdm_sorted = df_qdm.sort_values(by='square_feet')

# Create a color palette with a different color for each site_id
color_palette = plt.cm.get_cmap('tab20', len(df_qdm_sorted))

# Create a bar chart of square footage in ascending order with different colors
plt.figure(figsize=(12, 6))  # Adjust the figure size as needed

# Use 'site_id_msba' as the x-axis and 'square_feet' as the y-axis
# Specify the color for each bar based on the color_palette
df_qdm_sorted.plot(kind='bar', x='site_id_msba', y='square_feet', color=[color_palette(i) for i in range(len(df_qdm_sorted))])

plt.xlabel('Site ID')
plt.ylabel('Square Footage')
plt.title('Square Footage by Site ID (Ascending Order)')
plt.xticks(rotation=90)  # Rotate x-axis labels for better visibility
plt.tight_layout()

# Show the bar chart
plt.show()
<ipython-input-56-bb2319f1d611>:5: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

<Figure size 1200x600 with 0 Axes>
In [ ]:
# Group the data by 'site_id_msba' and calculate the total parking spaces for each site
site_parking_spaces = df_qdm.groupby('site_id_msba')['parking_spaces'].sum()

# Create a figure with a size that accommodates all site IDs
plt.figure(figsize=(16, 6))  # Adjust the figure size as needed

# Create an array of x-values for each site ID
x_values = np.arange(len(site_parking_spaces))

# Generate different colors for the bubbles using the 'tab20' colormap
colors = plt.cm.tab20(np.linspace(0, 1, len(site_parking_spaces)))

# Scatter plot with site IDs, parking spaces, and customized bubble sizes and colors
plt.scatter(x_values,  # X-values for site IDs
            site_parking_spaces.values,  # Bubble sizes based on parking spaces
            s=site_parking_spaces.values * 10,  # Customize bubble sizes (multiplied by 3 for larger size)
            c=colors,  # Use the generated colors
            alpha=0.7,  # Transparency
            label='Parking Spaces')

# Set x-axis labels to be the site IDs
plt.xticks(x_values, site_parking_spaces.index, rotation=90)  # Rotate x-axis labels for readability

# Add labels to the bubbles (optional)
for x, spaces in zip(x_values, site_parking_spaces.values):
    plt.annotate(spaces,
                 (x, spaces),  # Coordinates of the bubble
                 textcoords="offset points",
                 xytext=(0, 15),  # Adjust the distance from the bubble
                 ha='center')

# Customize plot labels and title
plt.xlabel('Site ID')
plt.ylabel('Parking Spaces')
plt.title('Bubble Chart: Parking Spaces vs. Site ID')

# Show legend
plt.legend()

# Display the bubble chart with distinct colors
plt.tight_layout()
plt.show()
In [ ]:
# Define a list of qualitative features to be plotted
qualitative_features = ['square_feet', 'parking_spaces']

# Create a subplot grid of 1 row and 2 columns, with a specific size
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# Set the overall title for the figure
fig.suptitle('Distribution of Store square feet and parking spaces')

# Loop through each feature in the 'qualitative_features' list
for i, col in enumerate(qualitative_features):
    # Create a histogram plot for each feature using Seaborn's 'histplot'
    # It uses 20 bins and also plots a Kernel Density Estimation (kde)
    sns.histplot(df_qdm[col], bins=20, kde=True, ax=axes[i])

    # Set the title for each subplot
    axes[i].set_title(f'Distribution of {col}')

    # Set the x-axis label for each subplot
    axes[i].set_xlabel(col)

    # Set the y-axis label for each subplot
    axes[i].set_ylabel('Frequency')

# Adjust layout to prevent overlap and set the rectangle dimensions for tight_layout
plt.tight_layout(rect=[0, 0, 1, 0.96])

plt.show()

Square Feet: The distribution of the store size appears to be right-skewed, indicating that most stores are smaller in size.

Parking Spaces: This variable also shows a slight right-skewed distribution, suggesting that most stores have fewer parking spaces.

In [ ]:
# Count the number of 'Yes' and 'No' entries in the 'lottery' column
lottery_counts = df_qdm['lottery'].value_counts()

# Create a bar chart
plt.figure(figsize=(8, 8))
ax = lottery_counts.plot(kind='bar', color=['skyblue', 'lightcoral'])

# Customize plot labels and title
plt.xlabel('Lottery Offered')
plt.ylabel('Count')
plt.title('Number of Sites Offering Lottery Tickets')

# Annotate the values on top of the bars
for i, v in enumerate(lottery_counts):
    ax.text(i, v + 0.2, str(v), ha='center', va='bottom', fontsize=12)

# Show the plot
plt.tight_layout()
plt.show()
In [ ]:
print("Categorical Variables:")
print(categorical_vars)

print("\nContinuous Variables:")
print(continuous_vars)
Categorical Variables:
['lottery', 'freal', 'bonfire_grill', 'pizza', 'cinnabon', 'ethanol_free', 'hi_flow_lanes', 'rv_lanes', 'cat_scales', 'rv_dumps', 'propane', 'traditional_forecourt_layout', 'traditional_forecourt_stack_type', 'rv_lanes_layout', 'rv_lanes_stack_type', 'hi_flow_lanes_layout', 'hi_flow_rv_lanes_layout']

Continuous Variables:
['open_year', 'years_since_last_project', 'lottery', 'square_feet', 'parking_spaces', 'x1_mile_pop', 'x1_mile_emp', 'x1_mile_income', 'x1_2_mile_pop', 'x1_2_mile_emp', 'x1_2_mile_income', 'x5_min_pop', 'x5_min_emp', 'x5_min_inc', 'x7_min_pop', 'x7_min_emp', 'x7_min_inc', 'traditional_forecourt_fueling_positions', 'rv_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions', 'mens_toilet_count', 'mens_urinal_count', 'womens_toilet_count', 'womens_sink_count', 'site_id_msba']
In [ ]:
# Create pie charts for each categorical column
for column in categorical_vars:
    if column in df_qdm.columns:
        # Count the frequency of each category in the column
        category_counts = df_qdm[column].value_counts()

        # Plot a pie chart
        plt.figure(figsize=(6, 6))
        plt.pie(category_counts, labels=category_counts.index, autopct='%1.1f%%', startangle=140)
        plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
        plt.title(f'Pie Chart for {column}')

        # Show the pie chart
        plt.show()
In [ ]:
df_qdm.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 55 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   Unnamed: 0                               37 non-null     int64 
 1   open_year                                37 non-null     int64 
 2   square_feet                              37 non-null     int64 
 3   front_door_count                         37 non-null     int64 
 4   years_since_last_project                 37 non-null     int64 
 5   parking_spaces                           37 non-null     int64 
 6   lottery                                  37 non-null     object
 7   freal                                    37 non-null     object
 8   bonfire_grill                            37 non-null     object
 9   pizza                                    37 non-null     object
 10  cinnabon                                 37 non-null     object
 11  godfather_s_pizza                        37 non-null     object
 12  ethanol_free                             37 non-null     object
 13  diesel                                   37 non-null     object
 14  hi_flow_lanes                            37 non-null     object
 15  rv_lanes                                 37 non-null     object
 16  hi_flow_rv_lanes                         37 non-null     object
 17  def                                      37 non-null     object
 18  cat_scales                               37 non-null     object
 19  car_wash                                 37 non-null     object
 20  ev_charging                              37 non-null     object
 21  rv_dumps                                 37 non-null     object
 22  propane                                  37 non-null     object
 23  x1_mile_pop                              37 non-null     int64 
 24  x1_mile_emp                              37 non-null     int64 
 25  x1_mile_income                           37 non-null     int64 
 26  x1_2_mile_pop                            37 non-null     int64 
 27  x1_2_mile_emp                            37 non-null     int64 
 28  x1_2_mile_income                         37 non-null     int64 
 29  x5_min_pop                               37 non-null     int64 
 30  x5_min_emp                               37 non-null     int64 
 31  x5_min_inc                               37 non-null     int64 
 32  x7_min_pop                               37 non-null     int64 
 33  x7_min_emp                               37 non-null     int64 
 34  x7_min_inc                               37 non-null     int64 
 35  traditional_forecourt_fueling_positions  37 non-null     int64 
 36  traditional_forecourt_layout             37 non-null     object
 37  traditional_forecourt_stack_type         37 non-null     object
 38  rv_lanes_fueling_positions               37 non-null     int64 
 39  rv_lanes_layout                          37 non-null     object
 40  rv_lanes_stack_type                      37 non-null     object
 41  hi_flow_lanes_fueling_positions          37 non-null     int64 
 42  hi_flow_lanes_layout                     37 non-null     object
 43  hi_flow_lanes_stack_type                 37 non-null     object
 44  hi_flow_lanes_fueling_positions_2        37 non-null     int64 
 45  rv_lanes_fueling_positions_2             37 non-null     int64 
 46  hi_flow_rv_lanes_layout                  37 non-null     object
 47  hi_flow_rv_lanes_stack_type              37 non-null     object
 48  non_24_hour                              37 non-null     object
 49  self_check_out                           37 non-null     object
 50  mens_toilet_count                        37 non-null     int64 
 51  mens_urinal_count                        37 non-null     int64 
 52  womens_toilet_count                      37 non-null     int64 
 53  womens_sink_count                        37 non-null     int64 
 54  site_id_msba                             37 non-null     int64 
dtypes: int64(28), object(27)
memory usage: 16.0+ KB
In [ ]:
# List of population variables
population_variables = ['x1_mile_pop', 'x1_2_mile_pop', 'x5_min_pop', 'x7_min_pop']

# Loop through each population variable and create a bar plot
for variable in population_variables:
    # Group the data by site_id and calculate the mean for the population variable
    data_to_plot = df_qdm.groupby('site_id_msba')[variable].mean().reset_index()

    # Sort the data in descending order by the population variable
    data_to_plot = data_to_plot.sort_values(by=variable, ascending=False)

    # Create a bar plot
    plt.figure(figsize=(12, 6))
    sns.barplot(x='site_id_msba', y=variable, data=data_to_plot, palette='viridis', order=data_to_plot['site_id_msba'])
    plt.title(f'Bar Plot: {variable} by site_id (Descending Order)')
    plt.xlabel('site_id')
    plt.ylabel(variable)
    plt.xticks(rotation=90)
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # Show the plot
    plt.show()
In [ ]:
# List of employment variables
employment_variables = ['x1_mile_emp', 'x1_2_mile_emp', 'x5_min_emp', 'x7_min_emp']

# Loop through each employment variable and create a bar plot
for variable in employment_variables:
    # Group the data by site_id and calculate the mean for the employment variable
    data_to_plot = df_qdm.groupby('site_id_msba')[variable].mean().reset_index()

    # Sort the data in descending order by the employment variable
    data_to_plot = data_to_plot.sort_values(by=variable, ascending=False)

    # Create a bar plot
    plt.figure(figsize=(12, 6))
    sns.barplot(x='site_id_msba', y=variable, data=data_to_plot, palette='viridis', order=data_to_plot['site_id_msba'])
    plt.title(f'Bar Plot: {variable} by site_id (Descending Order)')
    plt.xlabel('site_id')
    plt.ylabel(variable)
    plt.xticks(rotation=90)
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # Show the plot
    plt.show()
In [ ]:
# List of median income variables
income_variables = ['x1_mile_income', 'x1_2_mile_income', 'x5_min_inc', 'x7_min_inc']

# Loop through each income variable and create a bar plot
for variable in income_variables:
    # Group the data by site_id and calculate the mean for the income variable
    data_to_plot = df_qdm.groupby('site_id_msba')[variable].mean().reset_index()

    # Sort the data in descending order by the income variable
    data_to_plot = data_to_plot.sort_values(by=variable, ascending=False)

    # Create a bar plot
    plt.figure(figsize=(12, 6))
    sns.barplot(x='site_id_msba', y=variable, data=data_to_plot, palette='viridis', order=data_to_plot['site_id_msba'])
    plt.title(f'Bar Plot: {variable} by site_id (Descending Order)')
    plt.xlabel('site_id')
    plt.ylabel(variable)
    plt.xticks(rotation=90)
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # Show the plot
    plt.show()
In [ ]:
df_qdm.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 55 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   Unnamed: 0                               37 non-null     int64 
 1   open_year                                37 non-null     int64 
 2   square_feet                              37 non-null     int64 
 3   front_door_count                         37 non-null     int64 
 4   years_since_last_project                 37 non-null     int64 
 5   parking_spaces                           37 non-null     int64 
 6   lottery                                  37 non-null     object
 7   freal                                    37 non-null     object
 8   bonfire_grill                            37 non-null     object
 9   pizza                                    37 non-null     object
 10  cinnabon                                 37 non-null     object
 11  godfather_s_pizza                        37 non-null     object
 12  ethanol_free                             37 non-null     object
 13  diesel                                   37 non-null     object
 14  hi_flow_lanes                            37 non-null     object
 15  rv_lanes                                 37 non-null     object
 16  hi_flow_rv_lanes                         37 non-null     object
 17  def                                      37 non-null     object
 18  cat_scales                               37 non-null     object
 19  car_wash                                 37 non-null     object
 20  ev_charging                              37 non-null     object
 21  rv_dumps                                 37 non-null     object
 22  propane                                  37 non-null     object
 23  x1_mile_pop                              37 non-null     int64 
 24  x1_mile_emp                              37 non-null     int64 
 25  x1_mile_income                           37 non-null     int64 
 26  x1_2_mile_pop                            37 non-null     int64 
 27  x1_2_mile_emp                            37 non-null     int64 
 28  x1_2_mile_income                         37 non-null     int64 
 29  x5_min_pop                               37 non-null     int64 
 30  x5_min_emp                               37 non-null     int64 
 31  x5_min_inc                               37 non-null     int64 
 32  x7_min_pop                               37 non-null     int64 
 33  x7_min_emp                               37 non-null     int64 
 34  x7_min_inc                               37 non-null     int64 
 35  traditional_forecourt_fueling_positions  37 non-null     int64 
 36  traditional_forecourt_layout             37 non-null     object
 37  traditional_forecourt_stack_type         37 non-null     object
 38  rv_lanes_fueling_positions               37 non-null     int64 
 39  rv_lanes_layout                          37 non-null     object
 40  rv_lanes_stack_type                      37 non-null     object
 41  hi_flow_lanes_fueling_positions          37 non-null     int64 
 42  hi_flow_lanes_layout                     37 non-null     object
 43  hi_flow_lanes_stack_type                 37 non-null     object
 44  hi_flow_lanes_fueling_positions_2        37 non-null     int64 
 45  rv_lanes_fueling_positions_2             37 non-null     int64 
 46  hi_flow_rv_lanes_layout                  37 non-null     object
 47  hi_flow_rv_lanes_stack_type              37 non-null     object
 48  non_24_hour                              37 non-null     object
 49  self_check_out                           37 non-null     object
 50  mens_toilet_count                        37 non-null     int64 
 51  mens_urinal_count                        37 non-null     int64 
 52  womens_toilet_count                      37 non-null     int64 
 53  womens_sink_count                        37 non-null     int64 
 54  site_id_msba                             37 non-null     int64 
dtypes: int64(28), object(27)
memory usage: 16.0+ KB

4.8 Feature Engineering- Label encoding for converting Categorical variables to Numerical variables¶

In [ ]:
from sklearn.preprocessing import LabelEncoder

# Use the 'categorical_vars' list to identify categorical columns

# Initialize a LabelEncoder
label_encoder = LabelEncoder()

# Encode each categorical column in 'categorical_vars'
for column in categorical_vars:
    if column in df_qdm.columns:
        df_qdm[column] = label_encoder.fit_transform(df_qdm[column])

# The categorical columns are now encoded with numerical values

# Display the first few rows of the DataFrame to verify
print(df_qdm.head())
   Unnamed: 0  open_year  square_feet  front_door_count  \
0           1       2021         5046                 2   
1           2       2021         5046                 2   
2           3       2021         5046                 2   
3           4       2021         5046                 2   
4           5       2021         5046                 2   

   years_since_last_project  parking_spaces  lottery  freal  bonfire_grill  \
0                         2              38        1      1              1   
1                         2              39        0      1              1   
2                         2              35        1      1              1   
3                         2              36        0      1              1   
4                         2              25        1      1              1   

   pizza  ...  rv_lanes_fueling_positions_2 hi_flow_rv_lanes_layout  \
0      0  ...                             6                       3   
1      1  ...                             4                       0   
2      1  ...                             5                       1   
3      1  ...                             4                       0   
4      0  ...                             0                       2   

   hi_flow_rv_lanes_stack_type non_24_hour  self_check_out  mens_toilet_count  \
0                        HF/RV          No             Yes                  2   
1                        HF/RV          No             Yes                  5   
2                         None          No             Yes                  3   
3                        HF/RV          No             Yes                  3   
4                         None          No             Yes                  0   

  mens_urinal_count womens_toilet_count  womens_sink_count site_id_msba  
0                 2                   6                  2        21560  
1                 5                  10                  4        21980  
2                 2                   4                  1        22015  
3                 3                   6                  2        22085  
4                 0                   0                  0        22120  

[5 rows x 55 columns]

5. Merging the time series dataset with the qualitative dataset¶

In [ ]:
merged_df = pd.merge(df_tsdm, df_qdm, on='site_id_msba', how='inner')

# Display the merged DataFrame
print(merged_df)
       Unnamed: 0_x capital_projects.soft_opening_date  \
0                 1                          6/14/2022   
1                 2                          6/14/2022   
2                 3                          6/14/2022   
3                 4                          6/14/2022   
4                 5                          6/14/2022   
...             ...                                ...   
13537         13904                          1/12/2021   
13538         13905                          1/12/2021   
13539         13906                          1/12/2021   
13540         13907                          1/12/2021   
13541         13908                          1/12/2021   

      calendar.calendar_day_date  calendar.fiscal_week_id_for_year  \
0                     2022-06-17                                25   
1                     2022-06-22                                25   
2                     2022-06-23                                25   
3                     2022-06-26                                26   
4                     2022-06-27                                26   
...                          ...                               ...   
13537                 2021-12-28                                52   
13538                 2022-01-01                                 1   
13539                 2022-01-04                                 1   
13540                 2022-01-07                                 2   
13541                 2022-01-11                                 2   

       calendar.day_of_week calendar_information.holiday  \
0                         4                         NONE   
1                         2                         NONE   
2                         3                         NONE   
3                         6                         NONE   
4                         0                         NONE   
...                     ...                          ...   
13537                     1                         NONE   
13538                     5               New Year's Day   
13539                     1                         NONE   
13540                     4                         NONE   
13541                     1                         NONE   

      calendar_information.type_of_day  daily_yoy_ndt.total_inside_sales  \
0                              WEEKDAY                         2168.2920   
1                              WEEKDAY                         2051.5635   
2                              WEEKDAY                         2257.5000   
3                              WEEKEND                         1520.5925   
4                              WEEKDAY                         1897.6930   
...                                ...                               ...   
13537                          WEEKDAY                         2984.0300   
13538                          WEEKEND                         2483.4145   
13539                          WEEKDAY                         3169.6875   
13540                          WEEKDAY                         3559.5805   
13541                          WEEKDAY                         3285.5305   

       daily_yoy_ndt.total_food_service   diesel_x  ...  \
0                              861.6930   722.7745  ...   
1                              808.0275   730.4850  ...   
2                              966.4410   895.7970  ...   
3                              542.3250   584.2900  ...   
4                              771.4525   852.2605  ...   
...                                 ...        ...  ...   
13537                          864.2795  1609.7830  ...   
13538                          597.8315   731.2935  ...   
13539                          970.4100  1851.6470  ...   
13540                         1072.4175  2104.9070  ...   
13541                         1061.8510  2337.6045  ...   

       hi_flow_lanes_fueling_positions_2  rv_lanes_fueling_positions_2  \
0                                      5                             4   
1                                      5                             4   
2                                      5                             4   
3                                      5                             4   
4                                      5                             4   
...                                  ...                           ...   
13537                                  4                             6   
13538                                  4                             6   
13539                                  4                             6   
13540                                  4                             6   
13541                                  4                             6   

       hi_flow_rv_lanes_layout  hi_flow_rv_lanes_stack_type non_24_hour  \
0                            0                        HF/RV          No   
1                            0                        HF/RV          No   
2                            0                        HF/RV          No   
3                            0                        HF/RV          No   
4                            0                        HF/RV          No   
...                        ...                          ...         ...   
13537                        3                        HF/RV          No   
13538                        3                        HF/RV          No   
13539                        3                        HF/RV          No   
13540                        3                        HF/RV          No   
13541                        3                        HF/RV          No   

       self_check_out mens_toilet_count  mens_urinal_count  \
0                 Yes                 1                  1   
1                 Yes                 1                  1   
2                 Yes                 1                  1   
3                 Yes                 1                  1   
4                 Yes                 1                  1   
...               ...               ...                ...   
13537             Yes                 2                  2   
13538             Yes                 2                  2   
13539             Yes                 2                  2   
13540             Yes                 2                  2   
13541             Yes                 2                  2   

      womens_toilet_count  womens_sink_count  
0                       2                  2  
1                       2                  2  
2                       2                  2  
3                       2                  2  
4                       2                  2  
...                   ...                ...  
13537                   6                  2  
13538                   6                  2  
13539                   6                  2  
13540                   6                  2  
13541                   6                  2  

[13542 rows x 79 columns]
In [ ]:
merged_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13542 entries, 0 to 13541
Data columns (total 79 columns):
 #   Column                                   Non-Null Count  Dtype         
---  ------                                   --------------  -----         
 0   Unnamed: 0_x                             13542 non-null  int64         
 1   capital_projects.soft_opening_date       13542 non-null  object        
 2   calendar.calendar_day_date               13542 non-null  datetime64[ns]
 3   calendar.fiscal_week_id_for_year         13542 non-null  int64         
 4   calendar.day_of_week                     13542 non-null  int64         
 5   calendar_information.holiday             13542 non-null  object        
 6   calendar_information.type_of_day         13542 non-null  object        
 7   daily_yoy_ndt.total_inside_sales         13542 non-null  float64       
 8   daily_yoy_ndt.total_food_service         13542 non-null  float64       
 9   diesel_x                                 13542 non-null  float64       
 10  unleaded                                 13542 non-null  float64       
 11  site_id_msba                             13542 non-null  int64         
 12  total_daily_sales                        13542 non-null  float64       
 13  total_daily_fuel_sales                   13542 non-null  float64       
 14  Date                                     13542 non-null  datetime64[ns]
 15  Year                                     13542 non-null  int64         
 16  Month                                    13542 non-null  object        
 17  Fiscal_Year                              13542 non-null  int64         
 18  Fiscal_Week                              13542 non-null  object        
 19  year                                     13542 non-null  int64         
 20  month                                    13542 non-null  int64         
 21  day                                      13542 non-null  int64         
 22  quarter                                  13542 non-null  int64         
 23  day_of_week                              13542 non-null  int64         
 24  is_weekend                               13542 non-null  int64         
 25  Unnamed: 0_y                             13542 non-null  int64         
 26  open_year                                13542 non-null  int64         
 27  square_feet                              13542 non-null  int64         
 28  front_door_count                         13542 non-null  int64         
 29  years_since_last_project                 13542 non-null  int64         
 30  parking_spaces                           13542 non-null  int64         
 31  lottery                                  13542 non-null  int64         
 32  freal                                    13542 non-null  int64         
 33  bonfire_grill                            13542 non-null  int64         
 34  pizza                                    13542 non-null  int64         
 35  cinnabon                                 13542 non-null  int64         
 36  godfather_s_pizza                        13542 non-null  object        
 37  ethanol_free                             13542 non-null  int64         
 38  diesel_y                                 13542 non-null  object        
 39  hi_flow_lanes                            13542 non-null  int64         
 40  rv_lanes                                 13542 non-null  int64         
 41  hi_flow_rv_lanes                         13542 non-null  object        
 42  def                                      13542 non-null  object        
 43  cat_scales                               13542 non-null  int64         
 44  car_wash                                 13542 non-null  object        
 45  ev_charging                              13542 non-null  object        
 46  rv_dumps                                 13542 non-null  int64         
 47  propane                                  13542 non-null  int64         
 48  x1_mile_pop                              13542 non-null  int64         
 49  x1_mile_emp                              13542 non-null  int64         
 50  x1_mile_income                           13542 non-null  int64         
 51  x1_2_mile_pop                            13542 non-null  int64         
 52  x1_2_mile_emp                            13542 non-null  int64         
 53  x1_2_mile_income                         13542 non-null  int64         
 54  x5_min_pop                               13542 non-null  int64         
 55  x5_min_emp                               13542 non-null  int64         
 56  x5_min_inc                               13542 non-null  int64         
 57  x7_min_pop                               13542 non-null  int64         
 58  x7_min_emp                               13542 non-null  int64         
 59  x7_min_inc                               13542 non-null  int64         
 60  traditional_forecourt_fueling_positions  13542 non-null  int64         
 61  traditional_forecourt_layout             13542 non-null  int64         
 62  traditional_forecourt_stack_type         13542 non-null  int64         
 63  rv_lanes_fueling_positions               13542 non-null  int64         
 64  rv_lanes_layout                          13542 non-null  int64         
 65  rv_lanes_stack_type                      13542 non-null  int64         
 66  hi_flow_lanes_fueling_positions          13542 non-null  int64         
 67  hi_flow_lanes_layout                     13542 non-null  int64         
 68  hi_flow_lanes_stack_type                 13542 non-null  object        
 69  hi_flow_lanes_fueling_positions_2        13542 non-null  int64         
 70  rv_lanes_fueling_positions_2             13542 non-null  int64         
 71  hi_flow_rv_lanes_layout                  13542 non-null  int64         
 72  hi_flow_rv_lanes_stack_type              13542 non-null  object        
 73  non_24_hour                              13542 non-null  object        
 74  self_check_out                           13542 non-null  object        
 75  mens_toilet_count                        13542 non-null  int64         
 76  mens_urinal_count                        13542 non-null  int64         
 77  womens_toilet_count                      13542 non-null  int64         
 78  womens_sink_count                        13542 non-null  int64         
dtypes: datetime64[ns](2), float64(6), int64(56), object(15)
memory usage: 8.3+ MB

5.1 Correlation Matrices¶

In [ ]:
import pandas as pd

# Define the target variables
target_variables = [
    'daily_yoy_ndt.total_inside_sales',
    'daily_yoy_ndt.total_food_service',
    'diesel_x',
    'unleaded'
]

# Variables to exclude
excluded_variables = ['total_daily_sales', 'total_daily_fuel_sales','Unnamed: 0_y','Unnamed: 0_x']

# Calculate the correlation matrix for the merged DataFrame
correlation_matrix = merged_df.corr()

# Create a dictionary to store the top 10 correlated variables for each target variable
top_correlated_variables = {}

# Iterate through each target variable
for target_variable in target_variables:
    # Get the absolute correlations between the target variable and all other variables
    correlations = correlation_matrix[target_variable].abs()

    # Exclude specified variables
    correlations = correlations.drop(excluded_variables, errors='ignore')

    # Sort the correlations in descending order and select the top 10 (excluding the target variable itself)
    top_correlations = correlations.drop(target_variable).nlargest(10)

    # Store the top correlated variables in the dictionary
    top_correlated_variables[target_variable] = top_correlations.index.tolist()

# Print the top 10 correlated variables for each target variable
for target_variable, correlated_variables in top_correlated_variables.items():
    print(f"Top 10 correlated variables with '{target_variable}':")
    print(correlated_variables)
    print()
Top 10 correlated variables with 'daily_yoy_ndt.total_inside_sales':
['daily_yoy_ndt.total_food_service', 'diesel_x', 'pizza', 'rv_lanes', 'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions_2', 'rv_lanes_fueling_positions', 'rv_lanes_fueling_positions_2', 'hi_flow_lanes', 'rv_lanes_stack_type']

Top 10 correlated variables with 'daily_yoy_ndt.total_food_service':
['daily_yoy_ndt.total_inside_sales', 'diesel_x', 'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions_2', 'hi_flow_lanes', 'rv_lanes_stack_type', 'rv_lanes', 'rv_lanes_fueling_positions', 'rv_lanes_fueling_positions_2', 'rv_lanes_layout']

Top 10 correlated variables with 'diesel_x':
['hi_flow_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions_2', 'mens_urinal_count', 'mens_toilet_count', 'womens_sink_count', 'daily_yoy_ndt.total_food_service', 'hi_flow_lanes', 'rv_lanes_stack_type', 'cat_scales', 'rv_lanes']

Top 10 correlated variables with 'unleaded':
['lottery', 'daily_yoy_ndt.total_inside_sales', 'x7_min_emp', 'open_year', 'x5_min_emp', 'x7_min_pop', 'years_since_last_project', 'site_id_msba', 'daily_yoy_ndt.total_food_service', 'x5_min_pop']

<ipython-input-70-862f048f3726>:15: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

Top 10 correlation variables with Total inside sales

In [ ]:
# Define the top 10 correlated variables
top_correlated_vars = [
    'daily_yoy_ndt.total_food_service', 'diesel_x', 'pizza', 'rv_lanes',
    'rv_lanes_stack_type', 'rv_lanes_layout', 'hi_flow_rv_lanes_layout',
    'hi_flow_lanes', 'rv_lanes_fueling_positions', 'hi_flow_lanes_layout'
]

# Extract the correlation matrix for the top correlated variables
correlation_matrix = merged_df[top_correlated_vars].corr()

# Create a correlogram (correlation matrix heatmap)
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlogram of Top Correlated Variables with daily_yoy_ndt.total_inside_sales')
plt.tight_layout()

# Show the correlogram
plt.show()
In [ ]:
# Define the top 10 correlated variables
top_correlated_vars = [
    'daily_yoy_ndt.total_food_service', 'diesel_x', 'pizza', 'rv_lanes',
    'rv_lanes_stack_type', 'rv_lanes_layout', 'hi_flow_rv_lanes_layout',
    'hi_flow_lanes', 'rv_lanes_fueling_positions', 'hi_flow_lanes_layout'
]

# Create a DataFrame for the correlation values
correlation_df = merged_df.corr()  # Replace with your actual correlation matrix

# Get the correlation values for 'daily_yoy_ndt.total_inside_sales' with the top 10 variables
correlation_values = correlation_df['daily_yoy_ndt.total_inside_sales'][top_correlated_vars]

# Create a bar chart for correlations
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x=top_correlated_vars, y=correlation_values, palette='viridis')
plt.title('Correlation with daily_yoy_ndt.total_inside_sales')
plt.xlabel('Variables')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability

# Annotate the bars with correlation values
for index, value in enumerate(correlation_values):
    bar_plot.text(index, value, f'{value:.2f}', ha='center', va='bottom', fontsize=10)

# Show the bar chart with annotations
plt.tight_layout()
plt.show()
<ipython-input-401-672d2503f3eb>:9: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

As per the correlated bar plot, the total inside sales were in strong correlation with 1. Food service sales 2. Diesel gallons sold and 3. Pizza availability.

Top 10 correlation variables with Total Food service sales

In [ ]:
# Define the top 10 correlated variables
top_correlated_vars = [
    'daily_yoy_ndt.total_inside_sales', 'diesel_x', 'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions_2',
    'rv_lanes_stack_type', 'hi_flow_lanes', 'rv_lanes', 'rv_lanes_fueling_positions', 'rv_lanes_fueling_positions_2', 'rv_lanes_layout'
]

# Create a DataFrame for the correlation values
correlation_df = merged_df.corr()  # Replace with your actual correlation matrix

# Get the correlation values for 'daily_yoy_ndt.total_food_service' with the top 10 variables
correlation_values = correlation_df['daily_yoy_ndt.total_food_service'][top_correlated_vars]

# Create a bar chart for correlations
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x=top_correlated_vars, y=correlation_values, palette='viridis')
plt.title('Correlation with daily_yoy_ndt.total_food_service')
plt.xlabel('Variables')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability

# Annotate the bars with correlation values
for index, value in enumerate(correlation_values):
    bar_plot.text(index, value, f'{value:.2f}', ha='center', va='bottom', fontsize=10)

# Show the bar chart with annotations
plt.tight_layout()
plt.show()
<ipython-input-72-de4ea397bddc>:8: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

As per the correlated bar plot, the total food service sales were in strong correlation with 1. Total Inside sales 2. Number of Diesel gallons sold 3. Availability of High flow fuel positions.

Top 10 correlation variables with Diesel

In [ ]:
# Define the top 10 correlated variables
top_correlated_vars = [
    'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions_2', 'mens_urinal_count', 'mens_toilet_count',
    'womens_sink_count', 'daily_yoy_ndt.total_food_service', 'hi_flow_lanes', 'rv_lanes_stack_type', 'cat_scales', 'rv_lanes'
]

# Create a DataFrame for the correlation values
correlation_df = merged_df.corr()  # Replace with your actual correlation matrix

# Get the correlation values for 'diesel' with the top 10 variables
correlation_values = correlation_df['diesel_x'][top_correlated_vars]

# Create a bar chart for correlations
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x=top_correlated_vars, y=correlation_values, palette='viridis')
plt.title('Correlation with diesel')
plt.xlabel('Variables')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability

# Annotate the bars with correlation values
for index, value in enumerate(correlation_values):
    bar_plot.text(index, value, f'{value:.2f}', ha='center', va='bottom', fontsize=10)

# Show the bar chart with annotations
plt.tight_layout()
plt.show()
<ipython-input-73-3320275aaed6>:8: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

As per the correlated bar plot, diesel gallons sold were in strong correlation with 1.Availability of High flow fuel positions.2.mens_urnial_count 3.mens_toilet_count

Top 10 correlation variables with Unleaded

In [ ]:
# Define the top 10 correlated variables
top_correlated_vars = [
    'daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'pizza', 'rv_lanes',
    'rv_lanes_stack_type', 'rv_lanes_layout', 'hi_flow_rv_lanes_layout',
    'hi_flow_lanes', 'rv_lanes_fueling_positions', 'hi_flow_lanes_layout'
]

# Create a DataFrame for the correlation values
correlation_df = merged_df.corr()

# Get the correlation values for 'unleaded' with the top 10 variables
correlation_values = correlation_df['unleaded'][top_correlated_vars]

# Create a bar chart for correlations
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x=top_correlated_vars, y=correlation_values, palette='viridis')
plt.title('Correlation with unleaded')
plt.xlabel('Variables')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability

# Annotate the bars with correlation values
for index, value in enumerate(correlation_values):
    bar_plot.text(index, value, f'{value:.2f}', ha='center', va='bottom', fontsize=10)

# Show the bar chart with annotations
plt.tight_layout()
plt.show()
<ipython-input-404-e99c214cc2af>:9: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

As per the correlated bar plot, all non-diesel gallons/unleaded sold were in strong correlation with 1. Lottery 2. Total Inside sales 3. years since last project.

5.2 Holiday vs Non-Holiday total daily fuel sales¶

In [ ]:
# Group the 'merged_df' DataFrame by the 'calendar_information.holiday' column
holiday_fuel_sales = merged_df.groupby('calendar_information.holiday')['total_daily_fuel_sales'].describe()

# Reset the index of the DataFrame to convert the grouped-by values into a column
holiday_fuel_sales.reset_index(inplace=True)

holiday_fuel_sales
Out[ ]:
calendar_information.holiday count mean std min 25% 50% 75% max
0 All Saint's Day 37.0 4410.757608 2779.691996 1145.5640 2892.421000 4071.84400 4851.675500 17498.2010
1 Ascension 34.0 4666.302015 3081.437077 1158.0450 3112.347875 4614.75525 5312.503875 19354.5240
2 Christmas Day 37.0 1551.972176 693.931634 337.1970 1110.791500 1424.07650 2169.359500 3076.4405
3 Christmas Eve 37.0 2473.872554 992.136641 843.9025 1757.644000 2262.54700 3024.581000 4894.6345
4 Columbus Day 37.0 4558.993392 2557.522924 1199.0510 3220.532000 4363.47450 5272.127000 15846.8380
5 Easter 36.0 3161.145458 1967.906769 702.0160 1747.592875 2662.85250 4336.787000 10843.1925
6 Father's Day 37.0 3341.215784 2249.101734 750.9355 1546.482000 2932.14950 4541.460000 11605.1810
7 Flag Day 38.0 4658.869618 3115.888701 1047.7915 3013.122000 4496.65825 5333.341125 19579.6755
8 Good Friday 36.0 4805.674708 2607.354268 1143.9645 3446.307375 4461.14375 6193.477500 15595.6255
9 Halloween Day 37.0 3451.179162 2530.643795 1191.6135 1931.660500 3001.17650 3894.352000 16029.1635
10 Independence Day 37.0 2913.870419 1846.872003 804.4610 1289.722000 2612.66600 3848.194000 9952.3935
11 Labor Day 37.0 3781.042338 2144.169470 1016.5645 2230.504500 3533.07150 4924.244500 10749.2175
12 Lincoln's Birthday 37.0 3080.713622 1489.664455 787.3495 2201.307500 2747.95150 3903.056500 6134.1140
13 Martin Luther King Day 37.0 3742.821108 1779.183035 1090.2045 2342.119500 3924.26650 4566.061500 8409.3590
14 Memorial Day 37.0 3649.906230 2263.730822 826.9555 2032.478000 3435.27800 4378.542000 13119.9845
15 Mother's Day 36.0 3201.183514 1688.735060 783.8460 1848.823375 3025.90050 4484.713625 7334.7645
16 NONE 12624.0 4171.412225 2549.105475 382.3365 2446.410750 3958.17100 5180.982625 23651.7295
17 New Year's Day 37.0 2048.330351 899.828017 573.9475 1365.843500 2079.25200 2657.298000 3624.8765
18 New Year's Eve 37.0 2830.532297 1140.178311 744.5970 2114.514500 2854.14150 3520.979000 5456.5700
19 Palm Sunday 35.0 3254.194300 2068.372616 696.1605 1900.253250 2742.57900 4459.848750 11501.0560
20 President's Day 37.0 4148.618203 2017.706565 1189.3490 2389.138500 4086.27800 5160.638000 10097.8010
21 Saint Patrick's Day 37.0 4274.989392 2320.501370 1018.5385 2855.055000 4198.61050 5137.457500 14241.8220
22 Saint Valentine's Day 37.0 3885.591500 2029.317577 1218.5215 2202.308500 3655.66950 4742.990000 9565.6995
23 Thanksgiving Day 37.0 2188.708446 1409.208831 704.3400 1378.114500 1927.22950 2499.952000 8747.1370
24 Veteran's Day 37.0 4590.720703 2643.833191 1062.6035 2881.259500 4456.57450 5209.540000 15532.7130
25 Washington's Birthday 37.0 3752.085608 2168.592030 1034.4495 2425.017000 3416.99750 4237.411500 11755.5165
In [ ]:
#Find the holiday with the highest average total daily fuel sales
holiday_fuel_sales[holiday_fuel_sales['mean']==holiday_fuel_sales['mean'].max()]
Out[ ]:
calendar_information.holiday count mean std min 25% 50% 75% max
8 Good Friday 36.0 4805.674708 2607.354268 1143.9645 3446.307375 4461.14375 6193.4775 15595.6255
In [ ]:
#Find the holiday with the lowest average total daily fuel sales
holiday_fuel_sales[holiday_fuel_sales['mean']==holiday_fuel_sales['mean'].min()]
Out[ ]:
calendar_information.holiday count mean std min 25% 50% 75% max
2 Christmas Day 37.0 1551.972176 693.931634 337.197 1110.7915 1424.0765 2169.3595 3076.4405

The holiday with the highest average total daily fuel sales is 'Good Friday', with a mean sales value of approximately 4805.67.

'Christmas Day' has the lowest average total daily sales, with a mean value of around 1551.97.

In [ ]:
# Create a new figure with specified size (width=20, height=10)
plt.figure(figsize=(20, 10))

# Create a boxplot wherein
# 'x' specifies the column for the x-axis ('calendar_information.holiday')
# 'y' specifies the column for the y-axis ('total_daily_fuel_sales')
sns.boxplot(data=merged_df, x='calendar_information.holiday', y='total_daily_fuel_sales')

# Set the title of the plot
plt.title('Distribution of Total Daily Fuel Sales by Holiday')

# Set the label for the x-axis
plt.xlabel('Holiday')

# Set the label for the y-axis
plt.ylabel('Total Fuel Sales (Gallons)')

# Rotate x-axis labels by 90 degrees for better visibility
plt.xticks(rotation=90)

# Display the plot
plt.show()

There is a similar variation in total daily gallons of fuel sold across different holidays. Holidays like 'Good Friday', 'Ascension', and 'Flag Day' tend to have higher median sales compared to others.

'Thanksgiving Day' and 'Christmas Day' appear to have lower median sales, which could be due to people spending more time at home.

In [ ]:
df_qdm.describe()
Out[ ]:
Unnamed: 0 open_year square_feet front_door_count years_since_last_project parking_spaces lottery freal bonfire_grill pizza ... hi_flow_lanes_fueling_positions hi_flow_lanes_layout hi_flow_lanes_fueling_positions_2 rv_lanes_fueling_positions_2 hi_flow_rv_lanes_layout mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
count 37.000000 37.000000 37.00000 37.0 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 ... 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000
mean 19.000000 2021.324324 4970.27027 2.0 1.648649 37.405405 0.621622 0.972973 0.702703 0.405405 ... 3.324324 0.459459 3.324324 2.513514 0.864865 2.378378 2.351351 4.648649 1.702703 23040.405405
std 10.824355 0.474579 575.93121 0.0 0.483978 5.918237 0.491672 0.164399 0.463373 0.497743 ... 2.925501 0.557504 2.925501 2.049683 1.031777 0.923500 0.856875 1.751447 0.740303 730.069801
min 1.000000 2021.000000 2933.00000 2.0 1.000000 23.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 21560.000000
25% 10.000000 2021.000000 5046.00000 2.0 1.000000 34.000000 0.000000 1.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 2.000000 4.000000 1.000000 22540.000000
50% 19.000000 2021.000000 5046.00000 2.0 2.000000 38.000000 1.000000 1.000000 1.000000 0.000000 ... 5.000000 0.000000 5.000000 4.000000 0.000000 2.000000 2.000000 4.000000 2.000000 22890.000000
75% 28.000000 2022.000000 5046.00000 2.0 2.000000 41.000000 1.000000 1.000000 1.000000 1.000000 ... 5.000000 1.000000 5.000000 4.000000 2.000000 3.000000 3.000000 6.000000 2.000000 23555.000000
max 37.000000 2022.000000 6134.00000 2.0 2.000000 49.000000 1.000000 1.000000 1.000000 1.000000 ... 9.000000 2.000000 9.000000 6.000000 3.000000 5.000000 5.000000 10.000000 4.000000 24535.000000

8 rows × 45 columns

6. First Year Sales¶

6.1 First Year Sales Analysis (Inside + Food Service)¶

In [ ]:
# Convert the 'capital_projects.soft_opening_date' column to datetime
merged_df['capital_projects.soft_opening_date'] = pd.to_datetime(merged_df['capital_projects.soft_opening_date'])

#get first year non-fuel sales ('total_daily_sales')
def get_first_year_sales(group):
    opening_date = group['capital_projects.soft_opening_date'].min()
    one_year_later = opening_date + pd.Timedelta(days=365)
    first_year_data = group[group['calendar.calendar_day_date'] <= one_year_later]
    return first_year_data['total_daily_sales'].sum()

first_year_sales = merged_df.groupby('site_id_msba').apply(get_first_year_sales)

#sort values from highest sales to lowest
first_year_sales_sorted = first_year_sales.sort_values(ascending=False)

first_year_sales_sorted.head(10)
Out[ ]:
site_id_msba
22085    2.323099e+06
22750    2.056672e+06
21980    2.000774e+06
23415    1.804503e+06
23660    1.664968e+06
23345    1.663042e+06
21560    1.569289e+06
23380    1.542020e+06
23730    1.533919e+06
24150    1.498044e+06
dtype: float64
In [ ]:
#get sites with top 10 highest sales
top_10_stores = first_year_sales_sorted.head(10).index.tolist()

#get qualitative data for sites with top 10 highest sales
qualitative_data_top_10 = df_qdm[df_qdm['site_id_msba'].isin(top_10_stores)]

qualitative_data_top_10
Out[ ]:
Unnamed: 0 open_year square_feet front_door_count years_since_last_project parking_spaces lottery freal bonfire_grill pizza ... rv_lanes_fueling_positions_2 hi_flow_rv_lanes_layout hi_flow_rv_lanes_stack_type non_24_hour self_check_out mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
0 1 2021 5046 2 2 38 1 1 1 0 ... 6 3 HF/RV No Yes 2 2 6 2 21560
1 2 2021 5046 2 2 39 0 1 1 1 ... 4 0 HF/RV No Yes 5 5 10 4 21980
3 4 2021 5046 2 2 36 0 1 1 1 ... 4 0 HF/RV No Yes 3 3 6 2 22085
14 15 2021 5046 2 2 34 0 1 1 1 ... 4 0 HF/RV No Yes 2 2 4 1 22750
22 23 2021 5046 2 2 29 1 1 1 1 ... 2 0 HF/RV No Yes 2 2 4 1 23345
23 24 2021 5046 2 2 34 0 1 1 1 ... 4 0 HF/RV No Yes 3 3 6 2 23380
24 25 2022 2940 2 1 23 0 1 1 1 ... 4 0 HF/RV No Yes 2 2 6 3 23415
28 29 2022 5046 2 1 28 0 1 1 1 ... 4 0 HF/RV No Yes 3 3 6 2 23660
29 30 2021 5046 2 2 37 0 1 1 1 ... 4 0 HF/RV No Yes 3 3 6 2 23730
33 34 2022 2933 2 1 30 1 1 1 0 ... 0 2 None No Yes 2 2 2 2 24150

10 rows × 55 columns

In [ ]:
#count unique values for 'open year'
qualitative_data_top_10['open_year'].value_counts()
Out[ ]:
2021    7
2022    3
Name: open_year, dtype: int64
In [ ]:
#count unique values for 'parking_spaces'
qualitative_data_top_10['parking_spaces'].value_counts()
Out[ ]:
34    2
38    1
39    1
36    1
29    1
23    1
28    1
37    1
30    1
Name: parking_spaces, dtype: int64
In [ ]:
#count unique values for 'square_feet'
qualitative_data_top_10['square_feet'].value_counts()
Out[ ]:
5046    8
2940    1
2933    1
Name: square_feet, dtype: int64

Top 10 stores with highest sales (USD)

Open Year: All these stores opened either in 2021 or 2022.

Square Feet: Most stores have a square footage of around 5046, except for two.

Parking Spaces: The number of parking spaces varies but generally ranges from 23 to 39.

Amenities: Most stores offer amenities like lottery, freal, and a bonfire grill.

6.2 First Year Fuel Sales Analysis (Diesel + Unleaded)¶

In [ ]:
#get first year fuel sales ('total_daily_fuel_sales')
def get_first_year_fuel_sales(group):
    opening_date = group['capital_projects.soft_opening_date'].min()
    one_year_later = opening_date + pd.Timedelta(days=365)
    first_year_data = group[group['calendar.calendar_day_date'] <= one_year_later]
    return first_year_data['total_daily_fuel_sales'].sum()

first_year_fuel_sales = merged_df.groupby('site_id_msba').apply(get_first_year_fuel_sales)

#sort values from highest fuel sales to lowest
first_year_fuel_sales_sorted = first_year_fuel_sales.sort_values(ascending=False)

first_year_fuel_sales_sorted.head(10)
Out[ ]:
site_id_msba
21980    5.018284e+06
22715    2.548204e+06
23660    2.460152e+06
22820    2.162210e+06
22260    2.053452e+06
22890    1.973883e+06
22575    1.874454e+06
23380    1.836589e+06
22785    1.768989e+06
23240    1.753365e+06
dtype: float64
In [ ]:
#get sites with top 10 highest fuel sales
top_10_stores_fuel = first_year_fuel_sales_sorted.head(10).index.tolist()

#get qualitative data for sites with top 10 highest fuel sales
fuel_qdata_top_10 = df_qdm[df_qdm['site_id_msba'].isin(top_10_stores_fuel)]

fuel_qdata_top_10
Out[ ]:
Unnamed: 0 open_year square_feet front_door_count years_since_last_project parking_spaces lottery freal bonfire_grill pizza ... rv_lanes_fueling_positions_2 hi_flow_rv_lanes_layout hi_flow_rv_lanes_stack_type non_24_hour self_check_out mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
1 2 2021 5046 2 2 39 0 1 1 1 ... 4 0 HF/RV No Yes 5 5 10 4 21980
5 6 2021 5046 2 2 38 1 1 0 0 ... 0 2 None No Yes 4 2 4 2 22260
10 11 2021 5046 2 2 44 1 1 0 1 ... 4 0 HF/RV No Yes 3 3 6 2 22575
13 14 2021 5046 2 2 46 1 1 1 1 ... 4 0 HF/RV No Yes 3 3 5 2 22715
15 16 2021 5046 2 2 41 1 1 1 0 ... 4 0 HF/RV No Yes 3 3 6 2 22785
16 17 2021 5046 2 2 36 1 0 1 1 ... 4 0 HF/RV No Yes 2 2 4 1 22820
18 19 2022 6134 2 1 48 1 1 1 0 ... 4 0 HF/RV No Yes 3 3 2 2 22890
21 22 2021 5046 2 2 41 1 1 1 1 ... 4 0 HF/RV No Yes 3 3 6 2 23240
23 24 2021 5046 2 2 34 0 1 1 1 ... 4 0 HF/RV No Yes 3 3 6 2 23380
28 29 2022 5046 2 1 28 0 1 1 1 ... 4 0 HF/RV No Yes 3 3 6 2 23660

10 rows × 55 columns

In [ ]:
#count unique values for 'open_year'
fuel_qdata_top_10['open_year'].value_counts()
Out[ ]:
2021    8
2022    2
Name: open_year, dtype: int64
In [ ]:
#count unique values for 'parking_spaces'
fuel_qdata_top_10['parking_spaces'].value_counts()
Out[ ]:
41    2
39    1
38    1
44    1
46    1
36    1
48    1
34    1
28    1
Name: parking_spaces, dtype: int64
In [ ]:
#count unique values for 'square_feet'
fuel_qdata_top_10['square_feet'].value_counts()
Out[ ]:
5046    9
6134    1
Name: square_feet, dtype: int64

Top 10 stores with highest fuel sales (Gallons)

Open Year: All these stores opened either in 2021 or 2022.

Square Feet: Most stores have a square footage of around 5046, except for one with 6134.

Parking Spaces: The number of parking spaces varies but generally ranges from 28 to 48.

Amenities: Most stores offer amenities like lottery, freal, and a bonfire grill.

6.3 First Year Sales Visualizations¶

In [ ]:
# Merge 'df_qdm' DataFrame with 'first_year_sales_sorted' DataFrame based on 'site_id_msba' using a left-join
first_year_sales_qualitative_data = pd.merge(
    df_qdm,
    pd.DataFrame(first_year_sales_sorted).reset_index(),
    left_on='site_id_msba',
    right_on='site_id_msba',
    how='left'
)

# Rename the column representing first-year sales to 'First_Year_Sales'
first_year_sales_qualitative_data.rename(columns={0: 'First_Year_Sales'}, inplace=True)

# Perform a similar merge for 'first_year_fuel_sales_sorted'
first_year_fuel_sales_qualitative_data = pd.merge(
    df_qdm,
    pd.DataFrame(first_year_fuel_sales_sorted).reset_index(),
    left_on='site_id_msba',
    right_on='site_id_msba',
    how='left'
)

# Rename the column representing first-year fuel sales to 'First_Year_Sales'
first_year_fuel_sales_qualitative_data.rename(columns={0: 'First_Year_Sales'}, inplace=True)

# Create a new figure for plotting with a specified size (width=12, height=6)
plt.figure(figsize=(12, 6))

# Create the first subplot in a 1-row, 2-column grid
plt.subplot(1, 2, 1)

# Create a scatter plot for 'Number of Parking Spaces vs. First-Year Sales'
sns.scatterplot(data=first_year_sales_qualitative_data, x='parking_spaces', y='First_Year_Sales', palette='viridis')
plt.title('Number of Parking Spaces vs. First-Year Sales')
plt.xlabel('Number of Parking Spaces')
plt.ylabel('First-Year Sales (USD)')

# Create the second subplot in the same 1-row, 2-column grid
plt.subplot(1, 2, 2)

# Create a scatter plot for 'Number of Parking Spaces vs. First-Year Fuel Sales'
sns.scatterplot(data=first_year_fuel_sales_qualitative_data, x='parking_spaces', y='First_Year_Sales', palette='viridis')
plt.title('Number of Parking Spaces vs. First-Year Fuel Sales')
plt.xlabel('Number of Parking Spaces')
plt.ylabel('First-Year Fuel Sales (Gallons)')

# Display the plots
plt.show()
<ipython-input-421-ad4a74534801>:32: UserWarning:

Ignoring `palette` because no `hue` variable has been assigned.

<ipython-input-421-ad4a74534801>:41: UserWarning:

Ignoring `palette` because no `hue` variable has been assigned.

The scatter plots shows the Number of Parking Spaces vs. First-Year Sales (Fuel and Non-Fuel sales) for the top 10 performing stores. Each point represents a store.

The number of parking spaces doesn't seem to have a strong correlation with first-year sales. The number of parking spaces varies among these stores, but their first-year sales still differ significantly.

In [ ]:
plt.figure(figsize=(15, 10))

# Create the first subplot in a 2-row, 3-column grid
# The boxplot shows the distribution of First-Year Sales based on the presence or absence of a lottery
plt.subplot(2, 3, 1)
sns.boxplot(x='lottery', y='First_Year_Sales', data=first_year_sales_qualitative_data)
plt.title('First-Year Sales (USD) by Lottery')

# Create the second subplot in the same grid
# This boxplot focuses on the feature 'freal'
plt.subplot(2, 3, 2)
sns.boxplot(x='freal', y='First_Year_Sales', data=first_year_sales_qualitative_data)
plt.title("First-Year Sales (USD) by Freal")

# Create the third subplot in the same grid
# This boxplot focuses on the feature 'bonfire_grill'
plt.subplot(2, 3, 3)
sns.boxplot(x='bonfire_grill', y='First_Year_Sales', data=first_year_sales_qualitative_data)
plt.title('First-Year Sales (USD) by Bonfire Grill')

# Create the fourth subplot, which focuses on First-Year Fuel Sales and the feature 'lottery'
plt.subplot(2, 3, 4)
sns.boxplot(x='lottery', y='First_Year_Sales', data=first_year_fuel_sales_qualitative_data)
plt.title('First-Year Fuel Sales (Gallons) by Lottery')

# Create the fifth subplot, which focuses on First-Year Fuel Sales and the feature 'freal'
plt.subplot(2, 3, 5)
sns.boxplot(x='freal', y='First_Year_Sales', data=first_year_fuel_sales_qualitative_data)
plt.title("First-Year Fuel Sales (Gallons) by Freal")

# Create the sixth and final subplot, which focuses on First-Year Fuel Sales and the feature 'bonfire_grill'
plt.subplot(2, 3, 6)
sns.boxplot(x='bonfire_grill', y='First_Year_Sales', data=first_year_fuel_sales_qualitative_data)
plt.title('First-Year Fuel Sales (Gallons) by Bonfire Grill')

# Adjust the layout so that plots do not overlap
plt.tight_layout()

# Display the figure containing all the subplots
plt.show()

Inside sales and Food service sales¶

First-Year Sales by Lottery: Stores without a lottery seem to have slightly higher median sales than those with a lottery. However, the difference is not very significant, and the range of sales is wide in both categories.

First-Year Sales by Freal: Stores with Freal appear to have slightly higher median sales compared to those without. Given the box plot for 'No' we can confirm that there is only 1 store with 'No' freal as seen in the dataset.

First-Year Sales by Bonfire Grill: Stores with a Bonfire Grill feature appear to have a lower median in first-year sales compared to those without this feature.

Fuel sale¶

First-Year Fuel Sales by Lottery: Stores with lottery seem to have very slight higher median fuel sales than those with a lottery. However, the difference is not very significant.

First-Year Fuel Sales by Freal: Stores with Freal appear to have slightly lower median sales compared to those without. Given the box plot for 'No' we can confirm that there is only 1 store with 'No' freal as seen in the dataset.

First-Year Fuel Sales by Bonfire Grill: Stores with a Bonfire Grill feature appear to have a slightly higher median in first-year sales compared to those without this feature.

6.4 Correlation Matrices for First Year Sales¶

In [ ]:
# Calculate the correlation matrix for numerical columns in the 'first_year_sales_qualitative_data' DataFrame
correlation_matrix = first_year_sales_qualitative_data.select_dtypes(include=['number']).corr()

plt.figure(figsize=(20, 15))

# Generate a heatmap to visualize the correlation matrix with spearman coefficients
sns.heatmap(correlation_matrix[['First_Year_Sales']].sort_values(by='First_Year_Sales', ascending=False),
            annot=True,
            cmap='coolwarm',
            vmin=-1,
            vmax=1,
            linewidths=.5)

# Set the title for the heatmap
plt.title('Correlation Matrix: Features vs. First-Year Sales (USD)')

# Display the heatmap
plt.show()

In this correlation heatmap we can see pizza and laning positions have a strong positive correlation and population features have a low correlation with First-Year sales (USD)

In [ ]:
# Calculate the correlation matrix for numerical columns in the 'first_year_fuel_sales_qualitative_data' DataFrame
correlation_matrix = first_year_fuel_sales_qualitative_data.select_dtypes(include=['number']).corr()

plt.figure(figsize=(20, 15))

# Generate a heatmap to visualize the correlation matrix with spearman coefficients
sns.heatmap(correlation_matrix[['First_Year_Sales']].sort_values(by='First_Year_Sales', ascending=False),
            annot=True,
            cmap='coolwarm',
            vmin=-1,
            vmax=1,
            linewidths=.5)

# Set the title for the heatmap
plt.title('Correlation Matrix: Features vs. First-Year Fuel Sales (Gallons)')

# Display the heatmap
plt.show()

In this correlation heatmap we can see toilets and laning positions have a strong positive correlation and population features and lane layouts have a low correlation with First-Year fuel sales (Gallons)

6.5 Distribution of First Year Sales¶

In [ ]:
plt.figure(figsize=(14, 6))

# Create the first subplot
# Generate a histogram of 'first_year_sales_sorted' with 20 bins and a Kernel Density Estimation (kde)
plt.subplot(2, 2, 1)
sns.histplot(first_year_sales_sorted, bins=20, color='skyblue', kde=True)
plt.title('Histogram of First-Year Sales')
plt.xlabel('First-Year Sales (USD)')
plt.ylabel('Frequency')

# Create the second subplot
# Generate a boxplot of 'first_year_sales_sorted'
plt.subplot(2, 2, 2)
sns.boxplot(x=first_year_sales_sorted, color='salmon')
plt.title('Boxplot of First-Year Sales')
plt.xlabel('First-Year Sales (USD)')

# Create the third subplot
# Generate a histogram of 'first_year_fuel_sales_sorted' with 20 bins and a Kernel Density Estimation (kde)
plt.subplot(2, 2, 3)
sns.histplot(first_year_fuel_sales_sorted, bins=20, color='skyblue', kde=True)
plt.title('Histogram of First-Year Fuel Sales')
plt.xlabel('First-Year Fuel Sales (Gallons)')
plt.ylabel('Frequency')

# Create the fourth subplot
# Generate a boxplot of 'first_year_fuel_sales_sorted'
plt.subplot(2, 2, 4)
sns.boxplot(x=first_year_fuel_sales_sorted, color='salmon')
plt.title('Boxplot of First-Year Fuel Sales')
plt.xlabel('First-Year Fuel Sales (Gallons)')

plt.tight_layout()

# Display the figure containing all the subplots
plt.show()

We can see that the distribution for the First-Year Sales is right skewed meaning that the first-year sales are usually on the lower side with some outliers on the higher-end.

7. Top Performing store (First-Year non-fuel sales)¶

In [ ]:
first_year_sales_sorted.index[0]
Out[ ]:
22085
In [ ]:
from statsmodels.tsa.seasonal import seasonal_decompose

# Find the store ID with the highest first-year sales ('top_store_id')
top_store_id = first_year_sales_sorted.index[0]

# Filter the data for the top store from the 'df_tsdm' DataFrame
top_store_data = df_tsdm[df_tsdm['site_id_msba'] == top_store_id]

# Add a new column 'total_daily_sales' to 'top_store_data' by filtering the top store's daily sales from 'merged_df'
top_store_data['total_daily_sales'] = merged_df[merged_df['site_id_msba'] == top_store_id]['total_daily_sales'].tolist()

top_store_data = top_store_data.sort_values('calendar.calendar_day_date').reset_index(drop=True)

# Set the date column as the index and convert it to a DatetimeIndex with daily frequency
top_store_data.set_index('calendar.calendar_day_date', inplace=True)
top_store_data.index = pd.DatetimeIndex(top_store_data.index).to_period('D')

# Fill any missing dates with a frequency of 'D' (daily) and fill missing values with 0
top_store_data = top_store_data.asfreq('D').fillna(0)

top_store_data.head()
<ipython-input-427-6ad61212beb1>:10: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[ ]:
Unnamed: 0 capital_projects.soft_opening_date calendar.fiscal_week_id_for_year calendar.day_of_week calendar_information.holiday calendar_information.type_of_day daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded ... Year Month Fiscal_Year Fiscal_Week year month day quarter day_of_week is_weekend
calendar.calendar_day_date
2021-02-02 12616 2021-02-02 5 1 NONE WEEKDAY 3085.4005 1195.7400 765.4710 1044.7430 ... 2021 2021-02-02 2021 05 2021 2 2 1 1 0
2021-02-03 12716 2021-02-02 5 2 NONE WEEKDAY 3567.2945 1357.3525 1419.1380 1460.5675 ... 2021 2021-02-03 2021 05 2021 2 3 1 2 0
2021-02-04 12534 2021-02-02 5 3 NONE WEEKDAY 4141.1230 1606.9690 1111.4215 1723.8865 ... 2021 2021-02-04 2021 05 2021 2 4 1 3 0
2021-02-05 12617 2021-02-02 6 4 NONE WEEKDAY 3795.4245 1455.3980 1292.8895 1662.8885 ... 2021 2021-02-05 2021 05 2021 2 5 1 4 0
2021-02-06 12717 2021-02-02 6 5 NONE WEEKEND 1883.9975 592.1895 808.9200 1043.1820 ... 2021 2021-02-06 2021 05 2021 2 6 1 5 1

5 rows × 24 columns

7.1 Time-Series Decomposition for the Top-performing store¶

In [ ]:
# Perform seasonal decomposition on 'total_daily_sales' of the top-performing store
decomposition = seasonal_decompose(top_store_data['total_daily_sales'], model='additive', period=7)

plt.figure(figsize=(14, 8))

# Plot the observed (original) values
plt.subplot(411)
plt.plot(decomposition.observed.values, label='Original')
plt.legend(loc='best')
plt.title('Time Series Decomposition of Daily Sales for Top Performing Store')

# Plot the trend component
plt.subplot(412)
plt.plot(decomposition.trend.values, label='Trend')
plt.legend(loc='best')

# Plot the seasonal component
plt.subplot(413)
plt.plot(decomposition.seasonal.values, label='Seasonal')
plt.legend(loc='best')

# Plot the residual component
plt.subplot(414)
plt.plot(decomposition.resid.values, label='Residual')
plt.legend(loc='best')

plt.tight_layout()

plt.show()

We can see a gradual upward trend in the total-daily sales. We have a weekly cycle since we chose period=7. We can see some spikes in the residual indicating some noise present which needs to be addressed.

7.2 Autocorrelation for Top Store¶

In [ ]:
plt.figure(figsize=(14, 7))

# Generate an autocorrelation plot for 'total_daily_sales' of the top-performing store
pd.plotting.autocorrelation_plot(top_store_data['total_daily_sales'])

# Set the title for the autocorrelation plot
plt.title('Autocorrelation Plot for the Top Store')

# Display the autocorrelation plot
plt.show()

The autocorrelation plot shown above is for the top performing store with the highest first-year sales. The plot shows how a value is correlated with its past values.

The blue region represents the confidence interval. Correlation values outside this region are statistically significant. The plot indicates some level of positive autocorrelation for lag values between 0 and approximately 20 days. Beyond 20 days, the autocorrelation values fall within the confidence interval, suggesting that sales beyond this point are not significantly correlated with the current day's sales.

7.3 Stationarity test¶

In [ ]:
from statsmodels.tsa.stattools import adfuller

# Perform the Augmented Dickey-Fuller (ADF) test on the 'total_daily_sales' of the top-performing store
result = adfuller(top_store_data['total_daily_sales'].dropna())

# Print the statistics
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
ADF Statistic: -3.0898204481906437
p-value: 0.027315750275665577
Critical Values: {'1%': -3.44911857009962, '5%': -2.8698097654570507, '10%': -2.5711757061225153}

ADF Statistic: -3.0898204481906437 The value of -3.089 suggests that the series is likely stationary.

p-value: 0.027315750275665577 The p-value represents the probability that the null hypothesis is true. The p-value is 0.0273, which is less than 0.05, reinforcing the idea that the series is stationary.

Critical Values:

1%: -3.44911857009962 5%: -2.8698097654570507 10%: -2.5711757061225153 In this case, the ADF statistic of -3.506 is less than the value of -3.449 at the 1% confidence level. This implies that we can be 99% confident that the series is stationary.

7.4 Time Series Decomposition of Top 5 Stores with the highest First-Year Non-fuel Sales¶

In [ ]:
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.dates as mdates

# Get the IDs of the top 5 stores based on first-year sales
top_5_stores = first_year_sales_sorted.head(5).index.tolist()

# Filter the merged_df to only include data for the top 5 stores
top_5_stores_data = merged_df[merged_df['site_id_msba'].isin(top_5_stores)]

fig, axes = plt.subplots(5, 3, figsize=(18, 15))

# Loop through each of the top 5 stores
for i, store in enumerate(top_5_stores):
    # Filter data for the current store
    store_data = top_5_stores_data[top_5_stores_data['site_id_msba'] == store]
    # Set the date as the DataFrame index and sort by date
    store_data = store_data.set_index('calendar.calendar_day_date')
    store_data.sort_index(inplace=True)

    # Perform seasonal decomposition with a weekly period (7 days)
    decomposition = seasonal_decompose(store_data['total_daily_sales'], period=7)

    # Plot and configure the Observed component for the current store
    axes[i, 0].plot(decomposition.observed)
    axes[i, 0].set_title(f"Store {store} - Observed")
    axes[i, 0].xaxis.set_major_locator(mdates.MonthLocator())
    axes[i, 0].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
    axes[i, 0].tick_params(axis='x', rotation=45)

    # Plot and configure the Trend component for the current store
    axes[i, 1].plot(decomposition.trend)
    axes[i, 1].set_title(f"Store {store} - Trend")
    axes[i, 1].xaxis.set_major_locator(mdates.MonthLocator())
    axes[i, 1].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
    axes[i, 1].tick_params(axis='x', rotation=45)

    # Plot and configure the Residual component for the current store
    axes[i, 2].plot(decomposition.resid)
    axes[i, 2].set_title(f"Store {store} - Residual")
    axes[i, 2].xaxis.set_major_locator(mdates.MonthLocator())
    axes[i, 2].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
    axes[i, 2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

We can compare the trend, seasonality and noise in the top 5 performing stores with non-fuel sales. We can see a gradual increase in trend in most of them, with heavy spikes in residuals ins some of the stores, usually occuring during the end of year.

8. Top Performing Store (First-Year Fuel Sales)¶

8.1 Time-Series Decomposition for the Top-performing store¶

In [ ]:
# Find the store ID with the highest first-year fuel sales
top_store_id_fuel = first_year_fuel_sales_sorted.index[0]

# Filter the data for the top fuel store from the 'df_tsdm' DataFrame
top_store_fuel_data = df_tsdm[df_tsdm['site_id_msba'] == top_store_id]

# Add a new column 'total_daily_sales' to 'top_store_fuel_data'
top_store_fuel_data['total_daily_sales'] = merged_df[merged_df['site_id_msba'] == top_store_id_fuel]['total_daily_fuel_sales'].tolist()

# Sort 'top_store_fuel_data' by date and reset its index
top_store_fuel_data = top_store_fuel_data.sort_values('calendar.calendar_day_date').reset_index(drop=True)

# Set the date column as the DataFrame's index
top_store_fuel_data.set_index('calendar.calendar_day_date', inplace=True)

# Convert the index to a Pandas DatetimeIndex with daily frequency
top_store_fuel_data.index = pd.DatetimeIndex(top_store_fuel_data.index).to_period('D')

# Set the DataFrame's frequency to daily and fill any missing values with 0
top_store_fuel_data = top_store_fuel_data.asfreq('D').fillna(0)

top_store_fuel_data.head()
<ipython-input-432-0541b8d47253>:8: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[ ]:
Unnamed: 0 capital_projects.soft_opening_date calendar.fiscal_week_id_for_year calendar.day_of_week calendar_information.holiday calendar_information.type_of_day daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded ... Year Month Fiscal_Year Fiscal_Week year month day quarter day_of_week is_weekend
calendar.calendar_day_date
2021-02-02 12616 2021-02-02 5 1 NONE WEEKDAY 3085.4005 1195.7400 765.4710 1044.7430 ... 2021 2021-02-02 2021 05 2021 2 2 1 1 0
2021-02-03 12716 2021-02-02 5 2 NONE WEEKDAY 3567.2945 1357.3525 1419.1380 1460.5675 ... 2021 2021-02-03 2021 05 2021 2 3 1 2 0
2021-02-04 12534 2021-02-02 5 3 NONE WEEKDAY 4141.1230 1606.9690 1111.4215 1723.8865 ... 2021 2021-02-04 2021 05 2021 2 4 1 3 0
2021-02-05 12617 2021-02-02 6 4 NONE WEEKDAY 3795.4245 1455.3980 1292.8895 1662.8885 ... 2021 2021-02-05 2021 05 2021 2 5 1 4 0
2021-02-06 12717 2021-02-02 6 5 NONE WEEKEND 1883.9975 592.1895 808.9200 1043.1820 ... 2021 2021-02-06 2021 05 2021 2 6 1 5 1

5 rows × 24 columns

In [ ]:
# Perform seasonal decomposition on 'total_daily_fuel_sales' of the top-performing fuel store
decomposition = seasonal_decompose(top_store_fuel_data['total_daily_fuel_sales'], model='additive', period=7)

plt.figure(figsize=(14, 8))

# Plot the observed (original) values
plt.subplot(411)
plt.plot(decomposition.observed.values, label='Original')
plt.legend(loc='best')
plt.title('Time Series Decomposition of Daily Fuel Sales for Top Performing Store')

# Plot the trend component
plt.subplot(412)
plt.plot(decomposition.trend.values, label='Trend')
plt.legend(loc='best')

# Plot the seasonal component
plt.subplot(413)
plt.plot(decomposition.seasonal.values, label='Seasonal')
plt.legend(loc='best')

# Plot the residual component
plt.subplot(414)
plt.plot(decomposition.resid.values, label='Residual')
plt.legend(loc='best')

plt.tight_layout()
plt.show()

The time-series decomposition for the store with highest gallons of fuel sold seems to be similar to the time-series decomposition for the store with highest sales in dollar. The trend is gradually increasing with a weekly seasonality and some spike in noise.

8.2 Time Series Decomposition of Top 5 Stores with the highest First-Year Fuel Sales¶

In [ ]:
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.dates as mdates

# Get the IDs of the top 5 stores based on first-year fuel sales
top_5_fuel_stores = first_year_fuel_sales_sorted.head(5).index.tolist()

# Filter merged_df to only include data for the top 5 fuel-selling stores
top_5_fuel_stores_data = merged_df[merged_df['site_id_msba'].isin(top_5_fuel_stores)]

fig, axes = plt.subplots(5, 3, figsize=(18, 15))

# Loop through each of the top 5 fuel-selling stores
for i, store in enumerate(top_5_fuel_stores):
    # Filter data for the current store
    store_data = top_5_fuel_stores_data[top_5_fuel_stores_data['site_id_msba'] == store]
    # Set the date as DataFrame index and sort by date
    store_data = store_data.set_index('calendar.calendar_day_date')
    store_data.sort_index(inplace=True)

    # Perform seasonal decomposition with a weekly period (7 days)
    decomposition = seasonal_decompose(store_data['total_daily_sales'], period=7)

    # Plot and configure the Observed component for the current store
    axes[i, 0].plot(decomposition.observed)
    axes[i, 0].set_title(f"Store {store} - Observed")
    axes[i, 0].xaxis.set_major_locator(mdates.MonthLocator())
    axes[i, 0].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
    axes[i, 0].tick_params(axis='x', rotation=45)

    # Plot and configure the Trend component for the current store
    axes[i, 1].plot(decomposition.trend)
    axes[i, 1].set_title(f"Store {store} - Trend")
    axes[i, 1].xaxis.set_major_locator(mdates.MonthLocator())
    axes[i, 1].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
    axes[i, 1].tick_params(axis='x', rotation=45)

    # Plot and configure the Residual component for the current store
    axes[i, 2].plot(decomposition.resid)
    axes[i, 2].set_title(f"Store {store} - Residual")
    axes[i, 2].xaxis.set_major_locator(mdates.MonthLocator())
    axes[i, 2].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
    axes[i, 2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

We can compare the trend, seasonality and noise in the top 5 performing stores of fuel sales. We can see a gradual increase in trend in most of them, with heavy spikes in residuals in some of the stores, usually occuring during the end of year. The dips in trend can possibly due to the holiday season at the end-of-year when people prefer being indoors and there is a low vehicle turnout.

9. Questions¶

  1. The qualitative dataset had some missing values in the categorical columns. We have imputed then by creating a new category 'Not present' indicating that the feature is not available at the store. Is this the best way to deal with the null values? Should we use other imputation methods like mode, median?

  2. Most of the distributions we visualized were skewed indicating the presence of outliers. What method would be best for dealing with outliers, Winsorization or using z-score to remove outliers?

  3. Some columns are identical in the Qualitative Dataset. What needs to be done with those columns?

  4. Some columns are having values with no variation for all the sites. Do we need to remove those while modeling as their contribution to the analysis is insignificant ?

  5. Qualitative data provided to 37 sites but Time series data provided to 38 sites. Will we be getting the data for 38th site for Modeling ?

10. Results¶

Throughout our exploration, we focused on understanding how various features correlate with the sales features, to aid in forecasting first-year sales for Maverik's new stores. We began by examining the qualitative and time-series datasets separately, performing EDA that included missing value treatment, outlier detection, and basic visualizations like histograms and bar charts. Sales-related features ('daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', and 'unleaded') generally exhibited right-skewed distributions, as did store attributes like 'square_feet' and 'parking_spaces'. This skewness suggests that most stores have lower sales and smaller sizes. When examining the impact of holidays on 'total_sales', we found significant variations. For instance, 'Good Friday' showed the highest average sales, while 'New Year's Day' had the lowest. This variation in sales across holidays suggests that special events can significantly influence customer purchasing behavior. Time-series decomposition showed discernible seasonal patterns, particularly gradual increase in sales from Monday to Friday and then a dip on the Weekends. We also found a heavy dive in sales during the end of year. During 2nd,3rd quarters of Year 2022, sales were found to be highest with June-2022 is the month with highest sales.

11. Contributions¶

11.1. Daryle Bilog¶

I created visualizations to determine the revenue generating columns on our datasets by merging them together using inner merge. I also found out that there are several columns that has zero variations which might not be as useful on creating a model because they don't have discriminative power. Removing these columns can also increase computational effeciency of our model. I added questions for our group to ponder while doing this project.

11.2. Joe Sarnello¶

I did EDA on the qualitative and time series datasets indivually observing which sites, which days of the week, and what time of the year have the highest total sales. I also added some questions at the top that I was trying to answer while doing EDA.

11.3. Sanskriti Bhargava¶

I worked on visualizing and interpreting the distribution of various features. I further analysed the First-Year sales and the top performing stores. I also visualized and interpreted the patterns that emerged in the analysis of First-Year sales and sales of the top performing stores. This included feature visualizations, Correlation analysis, feature and target distribution analysis, Time Series Decomposition (trend, seasonality, residuals), Autocorrelation analysis, Stationarity test and Decomposition comparison of top 5 stores. I also analyzed and visualized the effect of holidays on sales.

11.4. Vinay Kumar Vascuri¶

I worked on the EDA of Time series dataset ,Qualitative dataset and Merged datasets. I analysed the Descriptive statistics of TimeSeries dataset which included Average Fuel & Non-Fuel sales, Correlation matrix of Target Variables with other variables & Feature engineering. Contributed to Introduction,Questions & Results area.